Introduction

Over the last decade, next-generation sequencing (NGS) technologies have brought immense improvements in plant genome sequencing throughput and reduced the cost, and many plant genomes have been sequenced using this technology, such as F. vesca (Shulaev et al. 2011), C. cajan (Varshney et al. 2012), G. raimondii (Wang et al. 2012; Paterson et al. 2012), G. arboreum (Li et al. 2014) and G. hirsutum (Li et al. 2015; Zhang et al. 2018). However, de novo assembly of large eukaryotic genomes has remained a great challenge with the NGS platform because of a significant amount of repeat contents. As a result, de novo assembly of chromosome-scale scaffolds has become a major constraint to the completion of a high-quality genome sequence. Compared with traditional methods, like genetic map and physical map, high-throughput chromosome conformation capture (Hi-C) technology (Belton et al. 2012) has assisted in the assembly of long scaffolds to produce chromosome-scale genome assemblies (Lightfoot et al. 2017). The Hi-C technique is based on restriction enzyme cutting sites, which are more evenly distributed and have much higher density than genetic map.

The Hi-C-based proximity-guided assembly was initially developed to study the three-dimensional (3D) conformation of chromosomes of yeast gene expression (Berkum et al. 2010). The working principle based on two regular models with Hi-C data: the first one is that the rate of Hi-C interaction is inversely proportional to the genomic distance between the pairs of loci; the second one is that the rate of Hi-C interaction of pairs of loci within a chromosome is significantly higher than that in different chromosomes (Xie et al. 2015). Based on these two models, Hi-C-based proximity-guided assembly was applied for de novo assembling of human beings, and subsequently for the assembling of mouse and Drosophila genomes which reported good results or improvement (Burton et al. 2013). With the success of testing and verifying this method in Arabidopsis thaliana (Xie et al. 2015), Hi-C based proximity-guide assembly has been reported as an effective and efficient method which subsequently has been used in many other plants.

Besides its commercial value, cotton also serves as a perfect model system for studying cell wall biosynthesis (Zhang et al. 2018), cell elongation (Guo et al. 2017), and polyploidization (Yuan et al. 2016). The Gossypium genus comprises more than 50 species including at least 5 tetraploid species and 45 diploid species. Diploid cotton species are divided into 8 sub-genomes, denoted A-G and K based on chromosome pairing relationships (Wendel 1992). Tetraploid cotton species, such as cultivated G. hirsutum (AD1) and G. barbadense (AD2), had formed by an allopolyploidy event about 1–2 million years ago (Paterson et al. 2012). These tetraploid cotton species share common ancestors with the modern New World species G. raimondii (D5) and the Old World A-genome species G. herbaceum (A1) or G. arboreum (A2). Previously, the genomes of different cotton species were sequenced and assembled including G. raimondii (Wang et al. 2012), G. arboreum (Li et al. 2014), and G. hirsutum (Li et al. 2015), respectively.

Among these cotton species, the genome of G. raimondii has the lowest complexity which has been sequenced and assembled using the next-generation Illumina paired-end sequencing strategy (Wang et al. 2012). Approximately 73% (281 scaffolds) of the assembled sequences were anchored to 13 chromosomes, covering 88.1% of the genome, while only 52.4% (228 scaffolds) of the total sequences were both ordered and oriented. The completeness and accuracy of the previous sequenced and assembled genome of G. raimondii (Wang et al. 2012) were relatively low due to the large number of repeat elements and the small number of genetic markers. In the present study, we conducted a de novo Hi-C sequencing of G. raimondii, and incorporated the new Hi-C data with the existing G. raimondii scaffolds (Wang et al. 2012) to improve the quality of the D-genome.

Methods

Tissue collection and Hi-C sequencing

Plant materials

The seeds of G. raimondii D5–1 were planted in an incubator at constant environmental conditions having 27 °C temperature, 60% relative humidity, 16/8 h light/dark photoperiod, and 100% fluorescent light. When the sixth euphylla came out, these seedlings were transplanted into big pots. Approximately 3 g young leaves from G. raimondii plants were collected and immediately treated with formaldehyde.

Hi-C pipeline

During this study, we have used the same Hi-C pipeline as in Arabidopsis thaliana (Xie et al. 2015). Before starting this experiment, we have tested the integrity of DNA from the formaldehyde-treated tissue, and then the DNA samples were isolated and digested by MboI instead of HindIII because of the shorter recognition site (only four bases of MboI). The resulting sticky ends were filled with nucleotides in which cytosine is biotinylated, and ligated the adjacent blunt ends to a chimeric circle under extremely dilute conditions. Subsequently, DNA was purified and broken into 300–500 base pairs using ultrasonic, pull-down the biotin-labeled DNA and performed the PCR reaction (10 cycles). After DNA purification, the finished Hi-C library was sequenced with an Illumina Hiseq (PE150). A total of 570 412 361 read-pairs were obtained.

Genome assembly based on Hi-C data

Assembling of G. raimondii genome involved three steps. First, valid Hi-C paired-end reads and contact matrix with a resolution of 100 kb were generated by HiC-Pro (Servant et al. 2015). The raw sequence data with low quality, unmapped and invalid mapped paired reads were filtered out by HiC-Pro and contact-matrix based on interaction frequency was created. HiC-Pro results showed that 95.6% of sequences were clean Q30 bases, showing a good quality of sequence data. After filtering out the Hi-C data, 81.95% of uniquely mapped sequences were valid paired-end reads. Thus, the valid paired-end reads (223304666) were used for further genome assembly. At the second step, Errors in scaffolds of the initial draft assembly were identified and corrected following the Aedes aegypti’s de novo assembly procedure (Dudchenko et al. 2017). Briefly, the errors were corrected by identifying the bins where a scaffold’s long-range contact pattern changes abruptly, which was unlikely a correct scaffold. We cut out the error bins as a new scaffold. There are 259 errors within scaffolds. At the third step, the G. raimondii genome was assembled with the Hi-C data by Lachesis (Burton et al. 2013), which contained clustering, ordering, and orienting. Finally, the assembled G. raimondii genome was assessed by heat-map and collinear analysis.

Repeats and gene annotation

Repeat sequences of G. raimondii genome were masked by RepeatMasker (v4.0.8) with a custom library generated from RepeatModeler v2.0.1 (Flynn et al. 2020). Repeat-masked sequences were obtained as a draft genome in gene prediction. de novo prediction, homology-based prediction and transcriptome-based prediction were combined to annotate the draft genome of G. raimondii. GlimmerHMM v3.0.4 (Chrysanthou et al. 2011) and AUGUSTUS v3.3.2 (Stanke et al. 2006) were run for de novo prediction. We used 9 transcriptomes (SRR389181, SRR203240, SRR203250, SRR8878792, SRR8878720, SRR8878562, SRR8878553, SRR8878556, and SRR8878559) of G.raimondii from different organs and growth stages to predict genes. Homologous proteins from Arabidopsis, maize, and rice were input into GenomeThreader (v1.6.1) to train models for homology-based prediction. Sequences of transcriptome samples were aligned to draft genome by HISAT2 v2.1.0 (Kim et al. 2019) and transcripts were assembled by StringTie v1.3.6 (Pertea et al. 2015). All files in general feature format were integrated into a final genome annotation file by EvidenceModeler v1.1.1 (Haas et al. 2008).

The reassembled genome was evaluated in three aspects: completeness, consistency and continuity. We evaluated the continuity according to the length of scaffold N50 calculated by our python script. The completeness was checked by BUSCO (v4.1.4; Simão et al. 2015) and the consistency was evaluated by comparative analysis between our reassembled genome and other published genomes of G. raimondii. Genome alignment was performed by Minimap2 v2.1 (Li, 2018) and dot plots were generated by dotPlotly (https://github.com/tpoorten/dotPlotly/). Orthologous genes of G. raimondii genomes were detected by Orthofinder v2.5.1 (Emms and Kelly, 2015), and the visualization of orthologous gene detection was implemented by jcvi (https://github.com/tanghaibao/jcvi). All parameters of the softwares in this article were set as default.

Results

Assembling results

The genome of G. raimondii was reassembled using Hi-C data. About 98.42% of total sequence length were clustered successfully, among which 99.72% and 97.37% of the clustered sequences were ordered and high-quality ordered, respectively. And approximately 99.92% of the ordered sequences were oriented with high-quality. The statistics of pseudo-chromosome length are shown in Table 1, while the indicator statistics of the initial draft genome and reassembly results are shown in Table 2. From the parameters like scaffolds number, N50 of previously draft genome and reassemble genome, we found that the G. raimondii genome using a Hi-C-based proximity-guided assembly is much better than the reported draft genome (Wang et al. 2012).

Table 1 The statistics of pseudo-chromosome length
Table 2 The statistics of the draft genome and the reassembled G. raimondii genome

Genome assessment results

The quality of the reassembled genome can be verified in four aspects. First, BUSCO analysis reported that 97.2% (> 90%) completeness scores for the reassembled genome, indicating that it was complete. Compared with ~ 32 000 genes in the old draft genome (besides scaffolds), 38 975 genes were found in the annotation of the reassembled genome. Second, The heat-map directly proved the validity of the processing methods. Based on the two regular models as we mentioned above, the heat-map of the reassembled results revealed that the diagonal interaction was much higher. The boundaries of each pseudo-chromosome are relatively clear and it is under low background noise that shows good reassembling results (Fig. 1). Third, we compared the draft genome and the reassembled genome with two previous genome assemblies of G. raimondii (Paterson et al. 2012; Udall et al. 2019) (Fig. 2). The reassemble genome had a general agreement in their alignments with two previous genome assemblies than the draft genome. Fourth, we compared homologous genes in the draft genome and reassembled genome with which in two previous genome assemblies of G. raimondii (Paterson et al. 2012; Udall et al. 2019) (Fig. 3). Both the draft genome and reassembled genome showed a good agreement with other two versions, but the sizes of chromosomes (like chromosome 4 and 8) in the reassembled genome were more reliable than the draft genome.

Fig. 1
figure 1

The heat-map of the reassembled result at a 100-kb resolution. Hi-C interactions were detected in the reassembled G.raimondii genome sequences, and most interactions maps of chromosome sequences suggested that the genome sequence was assembled in the correct order. The darker red dot is representing the higher interaction between bins

Fig. 2
figure 2

Genome comparisons between the draft genome and reassembled genome with two previous genome assemblies of G. raimondii. A) Genome alignment between G. raimondii (Wang et al. 2012) and G. raimondii (Paterson et al. 2012). B) Genome alignment between G. raimondii (Wang et al. 2012) and G. raimondii (Udall et al. 2019). C) Genome alignment between G. raimondii (new) and G. raimondii (Paterson et al. 2012). D) Genome alignment between G. raimondii (new) and G. raimondii (Udall et al. 2019)

Fig. 3
figure 3

The collinearity analysis of homologous genes between the reassemble genome, the draft genome (Wang et al. 2012) and reassembled genome (new) with two previous genome assemblies of G. raimondii (Paterson et al. 2012; Udall et al. 2019)

Discussion

In the present study, we applied de novo Hi-C sequencing of the G. raimondii genome to improve the quality and accuracy of its previously reported draft genome from Wang et al. (2012). Results from the comparative analysis of different parameters between the draft genome (Wang et al. 2012) and the current reassembled genome showed significantly improved quality as compared with the previous one. The reassembled genome is not good enough, because it is based on the next generation sequencing in 2012. The G. raimondii genome (Udall et al. 2019) was assembled to a chromosome level using PacBio long-read technology, but Hi-C and Bionano optical mapping may have a better assembly. But diffirent genome versions can provide different information, like some genes can only be annotated in a certain version of annotation file.