Background

Structural variations (SV) are genomic variations involving segments of DNA from 50 bases to several megabases. SVs consist of unbalanced rearrangements such as copy number variations (CNV) including insertions/deletions (Indels) and presence/absence variations (PAV), and balanced events like inversions and translocations [1,2,3,4]. Several mechanisms explain the SVs formation, such as recombination errors generated by non-homologous end- joining and non-allelic homologous recombination, genome duplication and transposition [1, 2]. The structural variations in humans were largely studied and Ho et al. reviewed the impact of the SVs in human diseases [4]. In plants, it has been shown that the SVs play a key role in the evolution of genomes and are responsible for phenotypic variations by affecting Transposable Elements (TEs) and genes [3, 5,6,7,8]. In particular, SVs were found in stress related and resistance genes [9,10,11,12,13], leading to local adaptation [14, 15], or linked to other traits of agronomical interest such as tomato fruit flavor, rice grain size or poplar wood formation [16,17,18].

Nowadays, the identification of SVs contributes to the construction of the pangenome reference sequence or super pangenome [19, 20]. This new approach to build a reference will better reflect the genetic diversity of the species, and at the same time expand the understanding of genome evolution, as well as enhance the knowledge on adaptive traits [21,22,23,24,25].

The development of new sequencing technologies has boosted studies of SVs present in a genome, which were detected until recently only by Comparative Genomic Hybridation (CGH) arrays or single nucleotide polymorphism (SNP) [26,27,28,29]. The 3rd generation sequencing offers new opportunities to identify SVs at a larger scale with two approaches. One approach is based on linked short reads, as in 10x Genomics and Hi-C approaches [30], and the second by generating long reads, as proposed by Pacific Biosciences [31] and Oxford Nanopore Technologies (ONT) [32, 33]. These approaches provide access to complex regions, increasing their uses to improve genome assemblies and to detect structural variations in human [4, 34,35,36,37], in Arabidopsis thaliana ecotypes [24, 38, 39] and T-DNA insertion lines [40, 41] and in other plants [42,43,44]. In parallel, a technology based on physical map and developed by Bionano Genomics [45], generates information on very large DNA molecules. These maps, named optical maps, are frequently generated to improve and validate sequencing assembly, to detect SVs in animals genomes [36, 46,47,48,49] and more recently in plants [7, 42, 43, 50]. These 3rd generation technologies with combination possibilities made possible the identification of genetic rearrangements between individuals at intra specific levels [50, 51].

Comparisons between sequencing technologies or SV detection software are no longer uncharted territory [24, 36, 38, 52]. However, the comparison of ONT and Bionano was only performed in animals (Chimpanzee [49] and Drosophila [53]), but not yet in plants. Here, we investigated the genomes of two most studies ecotypes of A. thaliana (Col-0 and Ler-1) obtained by both ONT and Bionano optical maps to compare the advantages of these two fundamentally different technologies, sequencing-based and physical map, to provide information on detection and characterization of SVs in plants.

Results

ONT sequencing and genome assembly

The ONT sequences of Arabidopsis thaliana ecotypes, Columbia (Col-0), here named as Evry.Col-0 and Landsberg erecta 1 (Ler-1), here named as Evry.Ler-1, were cleaned using the correction and trimming steps of Canu assembler [54]. A total of 9.8 Gb (N50 = 12.7 kb, 75X coverage) and 6.1 Gb (N50 = 16.5 kb, 47X coverage) were obtained for Evry.Col-0 and Evry.Ler-1, respectively (Additional file 1: Tables S1 and S2).

Cleaned Evry.Ler-1 ONT reads were aligned against the Ler reference genome with Minimap2 to estimate ONT data completeness [38, 55]. In total, 98.9% of the Ler reference genome was covered by the ONT Evry.Ler-1 reads. The cleaned Evry.Ler-1 reads were also mapped against the Col-0 TAIR10.1 reference genome achieving 95.2% of total genome coverage (Additional file 1: Table S3) [56]. Samtools depth tool [57] was then used on the Evry.Ler-1 ONT reads mapping against the Col-0 TAIR10.1 reference genome to estimate the coverage at each position. The average coverage of 100 kb windows was 46.9X, with depth fluctuations in centromeric regions (Fig. 1).

Fig. 1
figure 1

Circos visualization of Evry.Ler-1 SVs landscape. All comparisons were performed against the Col-0 TAIR10.1 reference genome per 100kb bins. From external to internal layer (Circle1 to Circle7): Circle1: Col-0 TAIR10.1 chromosomes (ticks every 100 kb): black and light grey rectangles represent centromeric and NOR regions respectively; Circle2: Average mapping coverage for Evry.Col-0 ONT reads (grey line) and Evry.Ler-1 ONT reads (orange line) with dark orange if coverage > 46X; Circle3: DLE-1 label density as purple line (dark purple if density > 18 label per 100 kb); Circle4: Genes density as green line (dark green if density > 23), NLR Genes [58] indicated as green rectangles; Circle5: TEs density as blue line (dark blue if density > 58); Circle6: ONT SVs occurrences as orange outward bars (dark orange bars represent ONT- specific SVs); Circle7: Bionano SVs occurrences as purple inward bars (dark purple bars represent Bionano-specific SVs)

To select the assembler that could produce a better output for our data, de novo assemblies for Evry.Col-0 and Evry.Ler-1 were performed with Canu [54], RA [59] and SMARTdenovo (SDN, [60]). Based on general statistics (assembly size, contig number, N50 size), SMARTdenovo software generated better assemblies for both ecotypes compared to Canu or RA. (Additional file 1: Tables S4 and S5). Indeed, the SDN assemblies resulted in 79 contigs for Evry.Col-0 (cumulative size =117 Mb, N50 = 12.5 Mb with L50 = 5 contigs) and 101 contigs for Evry.Ler-1 (cumulative sizes = 117 Mb, N50 = 10.7 Mb with L50 = 5 contigs). Assemblies using RA were more fragmented and chimeric contigs were identified with Canu assembler after MUMmer alignments on the reference chromosomes (Additional file 2: Figs. S1A-C and S2A-C). For all assemblers tested, centromeric regions were covered by many small contigs. These results were also supported by the alignments of the Evry.Col-0 and Evry.Ler-1 assemblies on the respective reference chromosomes Col-0 TAIR10.1 and Ler. The SDN were used to perform the subsequent SV analyses.

Optical maps generation

Genomic DNA was labeled using staining protocol with DLE-1 enzyme according to the manufacturer’s protocol. One run per ecotype on the Saphyr device was performed resulting to 577.5 Gb and 610.9 Gb of molecules for Evry.Col-0 and Evry.Ler-1 respectively. Molecules larger than 150 kb were selected leading to about 600-fold final coverage based on the theoretical 130 Mb Arabidopsis genome size (Additional file 1: Tables S6 and S7). A total of 17 and 14 optical maps with N50 of 14.6 Mb and 14.7 Mb were generated for Evry.Col-0 and Evry.Ler-1 respectively, leading to a genome size of 125 Mb for both ecotypes (Additional file 1: Tables S8 and S9).

The average label density of the Evry.Ler-1 optical maps was estimated at 18.47 per 100 kb (Additional file 1: Table S7). However, the DLE-1 density decreases in the centromeric regions due to molecule depth diminution and optical map breaks (Fig. 1, Additional file 2: Fig. S3A-E).

Structural variations detection

Structural variations detections were performed independently using the ONT and Bionano technologies data and were carried out in two ways: 1) Evry.Ler-1 versus Col-0 TAIR10.1 reference genome and 2) Evry.Col-0 versus Ler reference genome. The different types of structural variations detected in our study are described in Additional file 2: Fig. S4. We observed that general SVs characteristics (number, types and location) are similar in both ways, then only SV detection results from the Evry.Ler-1 assembly and optical maps against the Col-0 TAIR10.1 reference genome will be presented in detail. Description of SVs detected by comparing the SDN assembly and optical maps Evry.Col-0 with Ler reference genome are provided in Additional file 1: Tables S10-S14 and Additional file 2: Fig. S5A-E.

The sequence comparison of Evry.Ler-1 assembly to Col-0 TAIR10.1 reference genome using MUMmer show-diff utility [61] revealed 2186 potential SVs. A total of 119 SVs, called reference sequence junction (SEQ), break (BRK) and jump (JMP), found in centromeric, telomeric and nearby rDNA clusters, were considered to correspond to unresolved assembly regions into Evry.Ler-1 assembly compared to Col-0 TAIR10.1 reference genome and were filtered out (Additional file 1: Table S15).

The estimation of the ONT error sequencing rate was 4.0 and 4.9% for the Evry.Col-0 and Evry.Ler-1 of the trimmed corrected sequences respectively. Even if these error sequencing rates are inferior than previously described [62], to avoid false positive SV detection and to be comparable to Bionano technology, a filter on query ONT structural variations size (> 1 kb, SV detection size limit for high quality Bionano technology) was applied. On the 1184 SVs > 1 kb (54.2%), 591 insertions (INS), 581 deletions (DEL), 12 inversions (INV) were detected but no duplication (Table 1 and Fig. 2A).

Table 1 Characteristics of Evry.Ler-1 ONT and Bionano SVs, obtained after alignment against Col-0 TAIR10.1 reference genome
Fig. 2
figure 2

Number of Evry.Ler-1 structural variations detected by ONT and Bionano against the Col-0 TAIR10.1 reference genome and overlaps in locations between the two technologies. The bars and circles colored in orange and purple correspond respectively to the ONT and Bionano technologies. A Barplot of SV number for insertions (INS), deletions (DEL), inversions (INV), translocations (TRA) and all SVs (TOTAL) obtained using ONT and Bionano technologies. B Venn diagramm of common and specific locations detected by ONT and Bionano technologies

A 5 Mb insertion in the Evry.Ler-1 assembly was detected on Chr3 Col-0 TAIR10.1 reference genome (14,272,986..14284724) due to a detection error of MUMmer in a complex region associated with a rDNA cluster. Thereby, this insertion was removed from the final data and not considered in the result. The Evry.Ler-1 ONT median size of the structural variations was 3455 bp and the cumulated size of 7.7 Mb. The SVs were equally distributed in size and number between INS and DEL. The INV categories had higher median and average sizes than INS and DEL. With a cumulated size of 0.3 Mb, INV represented 3.9% of the ONT variation size (Table 1). Structural variations were detected on all chromosomes, with a preferential location on chromosome arms and with no confident SV on the Chr1, 3 and 4 centromeres (Fig. 1).

Optical maps construction and SVs detection based on physical maps comparison was carried out on the Bionano Solve™ interface (Bionano Genomics, version 3.3). A total of 797 SVs were highlighted by comparing Evry.Ler-1 optical maps to in silico Col-0 TAIR10.1 reference genome labeling with DLE-1(Additional file 1: Table S15). When Bionano Solve tools detected one SV embedded in a second one, the largest SV was kept. This case was found on two Chr1 independent locations (INS:19432310..19468513 and DEL:24688666..24736849). A 1 kb size filter was applied on the Bionano SVs, which was equivalent to remove deletions and insertions with a Bionano quality score < 10 (defined as poor quality by the manufacturer) (Additional file 1: Table S16). Additionally, on Chr2, the INV SV (3,433,371..3490731) with no quality score was discarded. Thereby, 591 SVs representing 74.2% of total optical map Evry.Ler-1 SVs were further considered in this analysis. INS and DEL types constituted the main part of the optical map Evry.Ler-1 SVs (48.9 and 49.9% of the SVs respectively), the remaining 1.2% corresponding to translocations (TRA) and INV (Table 1 and Fig. 2A). Median SVs size was 4383 bp and SVs cumulated sizes represented 7.2 Mb of the genome. The TRA and INV types corresponded to nearly one third (2.0 Mb) of the structural variations cumulated size. In our study, the translocation type was only detected using the Bionano assembly (Table 1 and Fig. 2A). The two Evry.Ler-1 TRA were located on Chr2 (3,378,844..3397121; 3,484,209..3844839) (Fig. 3A and Additional file 2: Fig. S3B). The largest SV identified was a 1.1 Mb Evry.Ler-1 INV located on Col-0 TAIR10.1 reference genome Chr4 (1,435,832..2593360) (Fig. 3B and Additional file 2: Fig. S3D). SVs were distributed preferentially along the chromosome arms and their detection was limited in centromeric regions due to decrease in labeling in these regions (Fig. 1).

Fig. 3
figure 3

Focus of large structural variations (MU) located in complex locations. For each location, optical maps are colored in green for the Col-0 TAIR10.1 reference maps (ChrM for mitochondrial chromosome map) and light blue for Evry.Ler-1 maps. Consistent DLE-1 enzyme label between reference and Evry.Ler-1 maps are represented by dark blue bars with grey links between the genome maps. Inconsistent DLE-1 enzyme label are shown by yellow bars on the two genome maps. The purple bars represent the insertion events on the Evry.Ler-1 maps / read / assembly, and the black, yellow and pink bars correspond to deletions, inversions and translocations respectively. Araport11 annotation of the Col-0 TAIR10.1 reference (Genes and TE) and IGV view of the Evry.Ler-1 trimmed ONT reads and SMARTdenovo contigs minimap alignments are also presented. A Bionano Chr2 Evry.Ler-1 translocations against Col-0 TAIR10.1 reference genome (MU_153). B Bionano Chr4 Evry.Ler-1 extra-range size inversion against Col-0 TAIR10.1 reference genome (MU_097). C Bionano Chr4 Evry.Ler-1 large deletion against Col-0 TAIR10.1 reference genome (MU_102). D Bionano Chr5 Evry.Ler-1 inversion against Col-0 TAIR10.1 reference genome (MU_138)

SVs comparison

SVs comparison was based on their absolute start- and end-positions on the Col-0 TAIR10.1 reference genome. We considered that structural variations locations were comparable in both technologies when their locations on Col-0 TAIR10.1 reference genome overlapped by at least 1 bp.

SVs comparison metrics are presented in Table 2 and the numbers of overlapping locations in Fig. 2B. A total of 563 common locations were identified representing 948 (80.1%) of Evry.Ler-1 ONT SVs and 563 (95.3%) of optical map Evry.Ler-1 SVs. The cumulated sizes of these common SVs were respectively 5.9 Mb and 6.9 Mb for ONT and Bionano detection representing 5.3% of the size of the Col-0 TAIR10.1 reference genome (based on 130 Mb) for ONT and 4.5% for Bionano. ONT SVs tended to be smaller than Bionano SVs (Table 2, Additional file 1 Tables S17 and S18).

Table 2 Characteristics of Evry.Ler-1 ONT and Evry.Ler-1 optical map SVs identified in common and specific Col-0 TAIR10.1 reference locations

To compare the median sizes of the ONT and Bionano variations (> 1 kb), we made notched boxplots including or not the large events (> 50 kb) (Fig. 4). Using the oriented Wilcoxon rank-sum test as it was performed by Dixon at al. (2018), p-values of the tests are all less than the significance level alpha = 0.05 therefore the median sizes of SV ONT are significantly smaller than the median sizes of SV Bionano. In addition, the sizes of the medians of all insertions and those of deletions detected using the Bionano technology were respectively 30.5 and 24.6% larger than with ONT. This last point is related to the fact that we applied a filter for ONT SVs (> 1 kb), thus increasing the median sizes for all categories.

Fig. 4
figure 4

Size distribution and median comparison of ONT and Bionano SV. All p-values were obtained with a two-sided oriented Wilcoxon rank-sum test. Hypothesis H0 was “median of ONT SV size is less than Bionano one”. ONT and Bionano boxplots are colored in orange and purple respectively. Medians are represented by red dots. A Boxplot of ONT (n=1184) and Bionano (n=591) SV>1kb. B Boxplot of ONT (n=1169) and Bionano (n=573) 1kb<SV<50kb. C Boxplot of ONT (n=591) and Bionano (n=289) INS>1kb. D Boxplot of ONT (n=588) and Bionano (n=282) 1kb<INS<50kb. E Boxplot of ONT (n=581) and Bionano (n=295) DEL>1kb. F Boxplot of ONT (n=571) and Bionano (n=288) 1kb<DEL<50kb

To go further, SVs identified by ONT and Bionano technologies were assigned to a two letters svID code. The first letter used for ONT SVs and the second for Bionano SVs, leading to common (svID UU and MU) and specific (svID UN and NU) locations (with “U” for “Unique location”, “M” for “Multiple locations” and “N” for “No location”, Additional file 1: Tables S17 and S18).

Among the 563 common regions, 410 (72.8%) coincided with svID UU, i.e. one ONT structural variation corresponding to one SV Bionano. For 364 (88.8%) of the sv ID UUs the overlap of these locations was 100% and for 30 (7.3%) greater than 50%. Only 16 (3.9%) svID UUs had less than 50% overlap (Additional file 1: Table S17). Moreover, 405 (98.8%) of the svID UU SVs have “conforming” type (i.e. have the same type) (Additional file 1: Table S17) and five svID UU (1.2%) were identified as deletions by ONT and insertions by Bionano technologies (svID UU_035, UU_038, UU_057, UU_073, UU_358; Additional file 1: Tables S17 and S18).

The remaining 153 (27.2%) common locations corresponded to 538 Evry.Ler-1 ONT SVs (56.8% of commons ONT SVs) related to 153 Evry.Ler-1 optical map SVs (27.2% of commons Bionano SVs) (Table 3).

Table 3 Characteristics of the svID MU identified in ONT and Bionano SVs

These structural variations had a svID MU ranging to 2 (representing 59.5%) to 22 ONT SVs for one Bionano SV. The cumulative size of this SVs category was approximately 4 Mb for both technologies although the number of ONT variants is 3.5 times higher than in Bionano (538 vs 153). The size distribution of these SVs started from 1 kb (due to the filter applied) to 87 kb and 1.1 Mb respectively for ONT and Bionano. Furthermore, Bionano median and average sizes were 2 and 4 fold larger respectively. Unlike the svID UU, the type of the svID MU was “conforming” for only 68 (44.5%) locations of which 58 (85.3%) corresponded to 2 ONT SVs for 1 Bionano SV. The remaining 10 (14.7%) locations comprised 3 or 4 ONT SVs for one Bionano SV.

The largest ONT SV was included in a complex SV (svID MU_102) consisting of four contiguous deletions located on Chr4. These four deletions coincided with one Evry.Ler1 optical map deletion (Fig. 3C, Additional file 1: Tables S17 and S18). The largest Evry.Ler1 optical map SV (svID MU_097) was an inversion on Chr4 of 1,143,224 Mb overlapping 22 Evry.Ler-1 ONT SVs (corresponding to INS and DEL) (Fig. 3B, Additional file 1: Tables S17 and S18). To enrich the list of the large (> 50 kb) Bionano SV, the svID MU_138, an INV of 207 kb was located on the Chr5 and covered 5 ONT structural variations (4 INS and 1 DEL) (Fig. 3D, Additional file 1: Tables S17 and S18).

Specific locations were more abundant with the ONT technology (236 SVs - svID UN, SV detected with ONT only - 19.9%) than with Bionano (28 SVs - svID NU, SV detected with Bionano only - 4.7%) leading to a cumulated size of 1.8 Mb and 0.3 Mb respectively, and with a median size twice larger (2656 bp for Evry.Ler-1 ONT SVs vs 1374 bp for Bionano Evry.Ler-1 optical map SVs). The distribution of the specific Evry.Ler-1 ONT SVs onto the Col-0 TAIR10.1 reference chromosomes led to a clear trend to locate on NOR and centromeres (Fig. 1). The largest specific ONT variant was located on Chr3 and corresponded to a DEL (svID UN_124, Additional file 1: Table S17). The largest specific Bionano SV was spotted on the Chr3 and corresponded to an INV type (svID NU_017, Additional file 1: Table S18, Additional file 2: Fig. S3C). A focus on the TRA located on Chr2 revealed 18.2 kb specific Evry.Ler-1 optical map SV (svID NU_007), close to the second TRA of 360 kb (MU_153). This last TRA (MU_153) coincided with seven SV events (1 INV, 5 INS and 1 DEL) in the Evry.Ler-1 assembly (Fig. 3A, Additional file 1: Tables S17 and S18).

Using Araport11 annotation of the Col-0 TAIR10.1 reference genome (The Arabidopsis Information Resource – TAIR), a comparison using only Evry.Ler1 ONT SVs is shown in Table 4. Since the Evry.Ler-1 optical map events represented a large-scale observation, they were not taken into account in this analysis. A total of 893 (75.4%) out of 1184 Evry.Ler1 ONT SVs overlapped TE features, of which 579 also overlapped genes. Only 291 (24.6%) Evry.Ler-1 ONT SVs were located outside a TE feature, overlapping genes [125 (10.6%)] or not [166 (14.0%)] (Table 4). Focusing on Evry.Ler-1 ONT specific SVs, their overlap with the Col-0 TAIR10.1 reference annotation showed a similar percentage compared to the common SVs.

Table 4 Evry.Ler-1 ONT SVs (> 1 kb) overlapping Col-0 TAIR10.1 reference genes and TEs annotation features

To better characterize the genes affected by ONT SVs in common locations, a GO-terms overrepresentation test was performed with the PANTHER’s tool [63] available on TAIR website (https://www.arabidopsis.org/tools/go_term_enrichment.jsp). Among the 1764 genes identified in common locations, 47.2% (832) genes were uniquely assigned to a GO term and used in PANTHER (Additional file 1: Tables S19 and S20). Overrepresentations in defense response and ADP-binding terms were detected (Additional file 1: Table S21), but no enrichment for GO-terms in genes in specific ONT locations was highlighted (Additional file 1: Tables S22-S24).

A summary of the main comparison criteria between the two technologies is presented in Table 5. It appears that the ONT and Bionano technologies (with DLS labeling) were equally effective in detecting SVs of less than 50 kb and those in gene regions. In our study, Bionano was more efficient on large events while additional analyzes for the detection of these variations with ONT are necessary.

Table 5 Summary of comparisons of ONT assembly and Bionano Optical mapping (DLS labeling) for SV detection

Discussion

Herein, we compare the performance of Oxford Nanopore and Bionano Genomics technologies for structural variation detection. For this, we performed long read sequencing and optical mapping of two A. thaliana ecotypes, namely Columbia-0 (Col-0) and Landsberg erecta 1 (Ler-1). Long read de novo assemblies were constructed using three different assemblers and optical maps were assembled with Bionano Solve tools. Structural variations detected using the Col-0 TAIR10.1 [56] and Ler [38] genomic sequences as references, were described and compared to each other, to reveal the relative strengths of the two technologies in detecting SVs.

Assemblies based on ONT and Bionano data for SV analyses

To obtain the best assembly based on only long reads data we used three different assemblers. After comparison of assembly metrics, calculation time and collinearity against reference genomes, SDN provided the best assembly even if some collinearity breaks were observed, especially in centromeric regions. The metrics of Evry.Col-0 and Evry.Ler-1 SDN assemblies were comparable to such assemblies in previous studies [24, 38, 39, 64] but remained underestimated.

Continuous improvement in protocols and new developments in genome assembly strategies and algorithms resulted in a higher quality of genomic sequences used in subsequent analyses. Previously published Bionano A. thaliana optical map (KBS-Mac-74 genome [39]) used a BspQI staining protocol for labeling, generating about 10 times more maps to cover the entire genome of the KBS-Mac-74 ecotype than in our study (DLE-1 Bionano staining protocol), highlighting enhancement in Bionano’s protocol. In addition, no optical map was previously available for the Columbia (Col-0) and Landsberg erecta 1 (Ler-1), making our map assemblies especially valuable for further studies.

Our high quality optical map allowed us to define centromeric and nucleolar organizer regions (NOR), despite lower molecules density and even if label concordance loss were observed between Evry.Ler-1 maps compared to the Col-0 TAIR10.1 in silico reference maps. Moreover, fluctuations in ONT coverage density and accumulation of repetitive alignments in the same regions are reinforcing evidence of the approximate locations of the centromeres and NOR. However, we identified several misassemblies in the course of our SVs analyses between the Evry.Ler-1 SDN assembly and Col-0 TAIR10.1 reference genome, highlighting how difficult it can be to get a reliable assembly, and thus detecting SVs, in these complex regions.

SV detection and comparison between the two technologies

We compared structural variations in Evry.Ler-1 and the reference genome Col-0 TAIR10.1. We chose this reference because of its high quality and the richness of the associated studies [24, 38, 39].

The cumulated SVs sizes obtained for ONT and Bionano in our study are smaller than in previous studies [24, 38]. Filtering on SVs size (SVs > 1 kb) vs no size filter could explain this difference. In addition, the lack of duplications detection in ONT assembly could depend on MUMmer’s ability to detect this type of SV, reflecting the detection complexity of the duplication events, as mentioned in Goel et al (2019). In contrast, the absence of duplication detected by Bionano could be explained by polymorphic duplications between Evry.Ler-1 maps and Col-0 TAIR10.1 reference genome, which would break the collinearity, as described in Jiao and Schneeberger (2020), and by the size of duplications (< 5 kb, [24, 64]) identified as the limit of Bionano detection.

Analyzes by the two technologies revealed a predominance of insertion, deletion and inversion with larger median and average sizes for Bionano SVs. The distribution of these types of SV is homogeneous along the chromosome arms. Most of the specific ONT SVs are located in the centromeric and pericentromeric regions. Nonetheless, a decreased coverage of the SVs in these regions was observed, in opposition previous report by Pucker et al. (2019), it can be assumed that this is probably due to technical problems such as assembly errors (for ONT SMARTdenovo) [65]. This diminution in SV coverage is also observed with Bionano technology, showing a lower density labeling in these complex regions. This contrasts previous results identifying more SVs in regions where the recombination meiotic rate decreases [24]. The filtering of SV ONTs smaller than 1 kb could again be an explanation for this contradiction. On the other hand, Bionano Solve tools well identified translocation previously characterized on Chr2 and three inversions larger than 50 kb present on Chr3 (NU_017), Chr4 (MU_097) and Chr5 (MU_138) [24, 38, 65]. For example, compared to the Col-0 TAIR10.1 reference genome, the Evry.Ler-1 maps support a 360 kb translocation of mitochondrial sequence in the Chr2 around the 3.6 Mb Col-0 TAIR10.1 position (svID MU_153). This observation is concordant with Stupar et al. (2001) that first described the mtDNA insertion in the Col-0 TAIR10.1 reference genome [66]. In the same Chr2 region (3.29 Mbp to 3.48 Mbp, Pucker et al. (2019) identified a second 300 kb highly divergent region between A. thaliana Nd-1 and Col-0 TAIR10.1 reference genome. In the same study, Pucker et al. also described the lack of the entire region between 3.29 Mbp and 3.48 Mbp in Ler reference genome, corresponding to the specific translocation of 18.2 kb detected in Evry.Ler-1 map (svID NU_007). Zooming in this Col-0 TAIR10.1 Chr2 region (3.2 Mb to 3.5 Mb) in the Evry.Ler-1 SDN assembly, many small contigs are observed with a missing sequence of 110 kb. This observation explains the absence of SV detection, confirming the great complexity of this region and the sequence divergence between Ler-1 and Col-0 genome described by Pucker et al (2019). Even if the Col-0 reference genome has been improved since 2000, it remained gaps and misassemblies as largely reported in many studies [41, 56, 65, 67, 68]. A combination of the best Col-0 TAIR10.1 sequence and the new high quality optical map data obtained in this study will provide valuable resources to re-evaluate complex region assembly.

The svID MU structural variations result either from a too low density of the DLE-1 sites, or from a high divergence of genomic regions between the two ecotypes. In both cases, experimental validations are essential to validate the number and type of SVs. Nevertheless, the fact that the largest events described (MU_097 (Chr4 INV), MU_102 (Chr4 DEL), MU_153 (Chr2 ONT SVs matching the second Bionano translocation) and MU_138 (Chr5 large inversion)) were retrieved in our study, is like a proof of concept of the ONT and Bionano technologies and the parameters used for the analysis.

Comparing locations of the Evry.Ler-1 ONT SVs with Araport11 annotations, we found that common and specific ONT SVs were preferentially linked to TE features and genes, as reported in Jiao and Schneeberger (2020). Looking at the GO-term enrichment in genes overlapping common ONT SVs, an overrepresentation in defense response and ADP-binding terms corresponding to resistance genes was observed. This result is concordant with previous studies [13, 24, 38, 69,70,71] in which an association between structural variations and the cluster organization of resistance genes was described.

General conclusion

Because analyses of SV locations and their consequences heavily rely on the quality of their identification and the underlying assembly/mapping data, we aimed to compare the performance of ONT and Bionano technologies for structural variation detection. Applying stringent filters on ONT assembly mapping approach and size filters on SVs, we have shown this methodology is an easy and efficient way to detect reliable SVs. Most of the detected SVs were also identified with Bionano optical maps with high concordance despite different characteristics (average, size, median). Nevertheless, long read sequencing technologies make it possible to detect SVs more accurately, while Bionano offers a broad overview of structural rearrangements. Thereby, the choice of technology has to be based on the characteristics of the locations to be studied. If these locations are known to be gene regions without repeated sequences, the analysis of an ONT assembly will be reliable and provide more confidence in the SVs locations. Bionano’s interest will then be minimal. In contrast, if these regions are identified as being complex (areas rich in transposable elements for example) the analysis of structural variations from an ONT assembly will be more delicate since the assembly itself and the alignments of the detection will be less reliable in these locations. ONT analyzes from these regions cannot be taken at face value and will require validation (targeted experimentation by labeling, PCR, detection of these SVs by other technology, progeny analysis …). On the other hand, Bionano technology is effective in validating variation in these large complex regions. Combined with Bionano analyzes which provide an overview and point to these areas, ONT analyzes and associated results gain in value. The major limit to Bionano is the lack of access to the sequence information. In addition, whole genome SVs analyses are currently mostly limited to model organisms. However, Oxford Nanopore long reads and Bionano Genomics optical maps assemblies do not require previous knowledge on the genomic architecture or the sequence of the studied organism, this approach expands the field of suitable plant species or species complexes where in-depth SVs analyses can be performed. Unlike in animals, in plants, the heterogeneity and size of genomes, polyploidy, heterozygosity and the sequence references of species which are sometimes very different and potentially of low quality make population analyzes difficult if not impossible. Therefore, population analysis using Bionano is only possible when the reference is of very high quality and genomically very close to other ecotypes. On the other hand, these plant characteristics have less impact on the detection of variations by ONT, which is much more local with this technology.

ONT appears to be especially suitable to carry out plant population analyses and Bionano more relevant to study plasticity of genome structure, leading to an obvious complementarity of these two technologies in SVs analysis.

Methods

Plants

Arabidopsis thaliana Columbia-0 (accession number 186AV) and Landsberg erecta-1 (accession number 213AV) seeds were provided by the Versailles Arabidopsis Stock Center (National Research Institute for Agriculture, Food and Environment, Versailles, France, http://publiclines.versailles.inra.fr/). They were sown directly in soil and transplanted after 10 days. Plantlets were grown under a 16 h light/8 h night photoperiod in a growth chamber at 20 °C for 4–5 weeks. Before to harvest, the plants were dark-treated for 3 days.

Oxford Nanopore sequencing (MinION) HMW DNA extraction

High Molecular Weight (HMW) DNA extraction was performed using a modified salting-out protocol. A total of 5 g of freshly harvested leaves was ground in liquid nitrogen with a mortar and pestle and transferred to 10 ml of 50 °C prewarmed extraction buffer in a 50 ml tube containing 1.25% SDS, 100 mM Tris-HCl, pH 8, 50 mM EDTA, 0.01% w/v PVP40. Then 37.5 μl of beta-mercaptoethanol (0.375% final) and 10 μl RNAse A (Qiagen® 100 mg/mL) were added. This solution was incubated for 30 min at 50 °C, under agitation (10 s at 300 rpm every 10 min). After incubation, 20 ml TE (10:1) were added, slowly homogenized then 10 ml of KAc 5 M. The tube was kept on ice for 5 min, then centrifuged at 4 °C during 10 min at 500 g. The solution was transferred in two 15 ml tubes and centrifuged again as previously. The supernatant was transferred in a 50 ml tube containing 1 volume of Isopropanol, slowly inverted 10 times, then centrifuged at 4 °C for 10 min at 5000 g. Pellets were washed with 20 ml ethanol 70% then centrifuged at 4 °C for 5 min at 5000 g. Supernatant was removed and pellets were not completely dried before solubilization in 100 μl of TE (10:1) prewarmed at 50 °C. The DNA solution was then incubated at 50 °C for 10 min. Field Inverted Gel Electrophoresis (Program 50–150 kb on Pipin Pulse from Sage Science) was used for DNA size estimation and DNA samples with molecule size above 50 kb were kept. Purity of DNA was evaluated by spectrophotometry (OD260/280 and OD260/230 ratio).

Bionano optical maps ultra HMW DNA extraction

We performed the DNA extraction using the Base protocol n°30,068 vD (Bionano Genomics) with minor adaptations. Three grams of very young fresh leaves from each genotype were harvested from the dark-treated rosettes. The samples were placed on aluminium foil on ice then transferred to a 50 ml tube surrounded by a screened cap allowing pouring without loss of samples (Bio-Rad) The tubes were kept on ice during the nuclear isolation. Samples were treated in a fixing solution containing 2% formaldehyde under a fume hood then rinsed with fixing solution without formaldehyde. Fixed-leaves were transferred to a square Petri dish with 4 ml of Plant Homogenization Buffer plus (HB+ is HB supplemented with 1 mM spermine tetrahydrochloride, 1 mM spermidine trihydrochloride, and 0.2% 2-mercaptoethanol). Entire leaves were chopped with a razor blade in 2x2mm pieces then transferred to a new tube on ice and 7.5 ml HB+ is added. Using TissueRuptor (Qiagen) the 2x2mm pieces were blended for a total of four cycles (20 s at maximum speed then resting 30 s). Plant homogenates were filtered, first through a 100 μm then to a 40 μm cell strainer and volumes were adjusted to 45 ml. Nuclei were centrifuged at 3840 g at 4 °C during 20 min, supernatants were discarded. Nuclei were gently re-suspended in residual buffer, 3 ml of HB+ were added, then tubes were swirled on ice and the volumes were adjusted to 35 ml. Homogenates were centrifuged at 60 g at 4 °C during 3 min using minimum deceleration. Solutions were very carefully transferred to a new tube in order to avoid carry-over of debris, and filtered again through a 40 μm cell strainer. Nuclei were centrifuged at 3840 g at 4 °C during 20 min, 3 ml of HB+ were added and tubes were swirled on ice. Using Bionano Nuclei Purification by Density Gradient, nuclei homogenate was laid on the top of two solutions with different densities. After a 4500 g centrifugation at 4 °C during 40 min, the nuclei are at the interface of the two solutions. There are recovered with a wide-bore tip in about 1 ml solution and transferred in a 15 ml tube and adjusted to 14 ml with HB+. Nuclei were centrifuged at 2500 g at 4 °C during 15 min. All the buffer was removed and nuclei were re-suspended in 60 μl HB+.

The nuclei solution was adjusted to 43 °C for 3 min and melted 2% agarose from CHEF Genomic DNA Plug Kits (Bio-Rad) was added to reach a 0.82% agarose plug concentration. Plugs were cooled on aluminum blocks refrigerated on ice. Purification of the plugs was performed with Bionano Lysis Buffer adjusted to pH 9 and supplemented with proteinase K and 0.4% 2-mercaptoethanol. Plugs were digested during 2 h at 50 °C in Thermomixer then the solution was refreshed and incubated again overnight. Plugs were treated with RNAse for 1 h at 37 °C in the remaining solution. Plugs were washed three times in Wash Buffer (Bionano Genomics) then four times in TE 10:1. DNA retrieval was performed as recommended by Bionano Genomics, as follow: plugs were melted at 70 °C during 2 min then transferred immediately at 43 °C and incubated 45 min at 43 °C with 2 μl Agarase (0.5 unit/μl). The melted plugs were recovered with wide-bore tips and dialyzed on a 0.1 μm membrane disk (Millipore) floating on 10 ml TE for 1 h. DNA was quantified in triplicates with Qubit according to Bionano protocol. Two methods were used to estimate the size of DNA molecules: Pipin Pulse and the Qcard Argus System (Opgen) which allows DNA combing on a lane and visualization of molecules after staining under fluorescent microscope. Samples with molecules above 150 kb were kept for labeling. Protocols were performed according to Bionano Genomics with 600 ng of DNA for both Col-0 and Ler-1 ecotypes. The direct label and stain (DLS) labeling consisted of a single enzymatic labeling reaction with DLE-1 enzyme following by DNA staining with a fluorescent marker. It was performed with 750 ng DNA. Chip loading was performed as recommended by Bionano Genomics.

ONT sequencing (MinION) and assembly

ONT libraries were prepared according to the following protocol, using the Oxford Nanopore SQK-LSK109 kit. Genomic DNA or DNA previously fragmented to 50 kb with a Megaruptor (Diagenode S.A., Liege, Belgium) was first size-selected using a BluePippin (Sage Science, Beverly, MA, USA). The selected DNA fragments were end-repaired and 3′-adenylated with the NEBNext® Ultra™ II End Repair/dA-Tailing Module (New England Biolabs, Ipswich, MA, USA). The DNA was then purified with AMPure XP beads (Beckmann Coulter, Brea, CA, USA) and ligated with sequencing adapters provided by Oxford Nanopore Technologies (Oxford Nanopore Technologies Ltd., Oxford, UK) using Blunt/TA Ligase Master Mix (NEB). After purification with AMPure XP beads, the library was mixed with Running Buffer with Fuel Mix (ONT) and Library Loading Beads (ONT) and loaded on 4 MinION R9.4 SpotON Flow Cells per Arabidopsis thaliana ecotypes. The resulting FAST5 files were base-called using albacore (versions 2.1.10 and 2.3.1) and FASTA produced as described in Istace et al (2017). Canu version 1.5 (github commit ae9eecc), was used for initial read correction and trimming with the parameters minMemory = 100G, corOutCoverage = 10,000. The corrected sequences were merged in one final FASTA file per ecotype that was later used as assemblers’input.

Assemblies were performed with the relevant genome size parameter set to, or coverage calculation based on, a 130 Mb genome size. Assemblers used with default parameters were Canu version 1.5 ([54], github commit 69b5f32), Rapid Assembler (RA, [59], https://github.com/lbcb-sci/ra commit 07364a1) and SMARTdenovo version 1.0 (with the option –c 1 to run the consensus step) ([60], https://github.com/ruanjue/smartdenovo commit 61cf13d). The MUMmer suite version 3.0 [61] was run with the parameters used in Zapata et al. 2016 [38]. To analyze the assemblies, they were aligned to the reference genome of Arabidopsis thaliana using nucmer with the options -c 100 -b 500 -l 50 -g 100 -L 50. The TAIR10.1 reference genome for A. thaliana Columbia 0 (Col-0, GCF_000001735.4) was chosen as it is the available sequence with the latest annotation. As Pucker et al. (2019) hightlighted, the nuclear sequence is the same as the TAIR9 reference genome but chloroplastic and mitochondrial sequences were added that were necessary to detect translocation with Bionano technology. The reference genome of Arabidopsis thaliana Landsberg erecta was the one published by Zapata et al. in 2016 (Ler, Genbank LUHQ00000000.1, [38]). The alignments were filtered with delta-filter (options − 1 -l 10,000 -i 0.95) and visualized with the mummer-plot (options --fat --large --layout –png) or DNAnexus (github commit 78e3317). These MUMmer parameters [38] allowed conserving exact matches larger than 50 bp and alignments longer than 10 kb with a minimal identity of 95%. To check assemblies completeness and fragmentation, they were compared to each other based on the metrics (Number of contigs, N50, cumulative genome sizes) and the genome alignments to the references generated with MUMmer viewed with the DNAnexus dot (https://dnanexus.github.io/dot/).

To evaluate the completeness of our ONT data, mapping of the corrected ONT reads on the Col-0 TAIR10.1 reference genome were performed with Minimap2/2.15 aligner [55] with -a -x map-ont parameters. The Samtools/1.6 depth tool with –a option [57] gave us the alignment depth at each Col-0 TAIR10.1 reference position. The error sequencing rate was inferred from the identity rate percent obtained by aligning the Evry.Col-0 and Evry.Ler-1 trimmed corrected ONT reads on the Col-0 TAIR10.1 and Ler reference genomes respectively.

Bionano optical map assembly

As it can be beneficial for assembly steps, molecules sub-sampling was conducted when flowcells yielded more than 90 Gb and 600X of data. This adapted selection of molecules was made on each run with the Bionano RefAligner tool in command line (version 1.3.8041.8044 with –minlen 180 –randomize 1 –subset 1 nb_molec options) or with Bionano Access (version Solve3.3 with Filter Molecule Object utility) (Additional file 1: Tables S6 and S7).

Maps were then constructed with the tool Generate de novo Assembly of the Bionano Solve™ (Bionano Genomics, version 3.3) using the options recommended by Bionano (With pre-assembly, Non haplotype without extend and split) and a 0.115 Gb genome size. The pre-assembly step calculates noise parameters that optimize the quality of the assembly (less and larger maps). When a reference FASTA file is added, noise parameters are calculated in aligning the molecules to the reference. Otherwise, the noise parameters are estimated thanks to a first rough assembly of the molecules. For Col-0 and Ler-1 ecotypes, three maps were obtained, one without reference, one with the Col-0 TAIR10.1 reference genome and one with the Ler reference genome (Additional file 1: Tables S8 and S9). In our study, the metrics of these assemblies are very similar. This stability reflects that noise parameters estimated either with references fasta sequences or our data, were comparable. This is a guaranty of the quality of Bionano data and assemblies.

ONT variation detection

Structural variations were obtained with MUMmer’s show-diff utility on the filtered alignments of SMARTdenovo assemblies against the reference genomes Col-0 TAIR10.1 and Ler. One DIFF file per comparison was obtained. Six SV types (Gap, Duplication, Break, Jump, Inversion, Sequence) were described in the Additional file 2: Fig. S4.

Bionano variation detection

SVs detections were performed on the optical maps built with the public reference and our SMARTdenovo ONT assemblies using the tool Convert SMAP to VCF file. VCF files were recovered, describing all the structural variations between the optical maps and the considered reference. The variations were classified into four types: deletion, insertion, translocation and inversion. SVs detection stringency is intrinsic, based on the number of aligned molecules (at least nine by default) and the number of labels across each variants breakpoint on the genome map (at least two by default) (Bionano tutorial: https://bionanogenomics.com/support-page/data-analysis-documentation/). The technology gave an interval with uncertainty about breakpoint positions (CIPOS and CIEND in VCF files). In this study, these values were used to calculate the most extended positions for the Bionano SVs and avoid the effect of label fuzz.

The low number of structural variations between Evry.Col-0 optical maps and the Col-0 TAIR10.1 reference genome (as Evry.Ler-1 optical maps and Ler reference genome) reflects the good collinearity between the maps and the references (Additional file 1: Table S25). SVs gave us an indication of the location of conflicts that could be due to mis-assemblies or intra-ecotype variations. Inter-ecotype detection allowed us to describe the variations between Evry.Col-0 and Evry.Ler-1.

Quality and length characteristics were used to better describe and filter SVs. Bionano Solve associates a quality score to each INS and DEL based on sensitivity and the fraction of alternative calls in mix assemblies that were called in the alternative genome assembly [from no quality (.) or poor (0) to confident quality (20)]. We observed that this indicator follows the same trend as the SVs size (Additional file 1: Tables S11 and S16). Moreover, size range values where SVs abundances are very different between both technologies at the extremes: the smallest (< 1 kb), where ONT technology detected much more SVs and the highest (> 5 kb) where Bionano technology detected proportionally more SVs. In our comparison analysis, to remove poor quality Bionano SVs, ONT sequencing errors and high sensitivity, a filter on query SV size (> 1 kp) was applied. Confidence scores for translocation and inversion breakpoints were computed as p-values, giving true confidence (in Mahalanobis distance) to positive calls. The recommended cutoffs are 0.1 and 0.01 for translocation and inversion breakpoints calls respectively and were used to eliminate uncertain inversion on Chr2.

SV description

Custom-made R and Perl scripts were used to edit other tools outputs, describe ONT and Bionano SVs (types, size), locate SVs along the chromosomes and filter them. For ONT technology, SVs identified as assemblies’discordances were quickly described and discarded before comparison. Those included sequences (SEQ), breaks (BRK) and jumps (JMP) ONT SV because they correspond to assembly or reference artifacts. Finally, size filters (more than 1 kb) were applied to take into account ONT high sequencing error rate, and low quality Bionano SVs. For Bionano SVs the largest absolute positions of the SV were conserved, taking into account the uncertainty around breakpoints due to the distance between two labels.

SV comparison

The ONT and Bionano SV medians sizes comparison was dealed with the ggpubr R package (http://www.sthda.com/english/wiki/unpaired-two-samples-wilcoxon-test-in-r). The boxplots were drawn with ggboxplot tool (my_data, x = “Technology”, y = “size”, fill = “Technology”, palette = c(“darkgoldenrod1”, “darkorchid”), notch = TRUE, ylab = “Size in base (log10)”, xlab = FALSE, ylim = c(0,7), add = “median”, add.params = list (size = 0.5,color = “red”)) and the Wilcoxon statistical test (H0 = median of ONT SV size is less than Bionano one) performed with stat_compare_means tool (method = “wilcox.test”, methods.args = list (alternative = “less”), label.y = 6). The test is considered as statistical when the p-value is less than the selected threshold. We chose here the standard one of 5%.

Comparison of SV obtained with both ONT and Bionano technologies were based on the overlap of their absolute positions.

ONT SV and Bionano SVs files were used after conversion to BED format to identify overlapping regions with BEDtools (version 2.27.1, github commit cd82ed5, “bedtools intersect -wa -wb -a INPUT1.bed -b INPUT2.bed -loj > OUTPUT.bed”). Raw comparisons were then compared, compiled and formatted in one final output file using custom-made R scripts. For each SVs location, this file contained descriptors (SVs size, type, quality) for both technologies, information on the type of conflict and a 2 letter code. This code characterized the SVs location as follows: the first letter corresponds to the ONT SV characterization, the second to the Bionano SV. M (“Multiple”) means more than one SV locations, U (“Unique”) one SV location, N (“No”) no SV location. For example, the code “MU” means that this location harbored multiple ONT SV corresponding to a unique Bionano location. No UM localization (corresponding to an ONT localization overlapping several Bionano SV localizations) was detected in our study. The landscapes and SVs occurrences visualization was performed with Circos/0.69.9 tool (perl/5.16.3 [72]).

SV and annotation

SVs overlapping a gene and/or TE were identified with the bedtools intersect by comparing their absolute positions to A. thaliana Col-0 annotations (July 11th 2019 release, TAIR10_GFF3_genes_transposons.gff). Lists of genes impacted by SV for both technologies were extracted and a GO-term enrichment analysis performed using Fisher’s Exact test with a Bonferroni correction in PANTHER (released 20,200,407 with GO Ontology database DOI: https://doi.org/10.5281/zenodo.3873405 Released 2020-06-01, [63], http://go.pantherdb.org/). Significance was evaluated based on a P-value ≤10–5 and an FDR value ≤0.01 [73].