Genome assembly scaffolding guided by haplotype information
The scaffolding approach that we propose is based on the presence of shared haplotypes in the genomes of a species. Thus, in principle, it makes use of the effect of linkage disequilibrium (LD) (Lewontin and Kojima 1960). Each individual plant that is compared to a reference genome can show either the reference allele or a non-reference allele at a given genomic position. Since variation is inherited within the species we expect groups of individuals that show non-reference alleles at the same genomic reference positions. Depending on the size of the genomic region analyzed we find a number of such groups, each representing one haplotype in this region (Fig. 1). In a mapping population where the parental haplotypes are distinguished by a known set of variant positions (or any set of markers) one can assign each region of the genomes of the following generation to one of the parental haplotypes. In this study, when looking at individuals that are not directly related, we do not know how many “initial” haplotypes should be considered so that we determine haplotypes simply based on shared variation within a given genomic region. In other words, any small genomic region is expected to show two groups of individuals, the one reference-like and the other one different from the reference. In a larger genomic region, more than two groups may become apparent as more positions are considered. This grouping of individuals based on shared variants results in a variation pattern for each genomic position in the genome. The change from one pattern into another one is expected to happen through longer genomic distances (corresponding to expected distances for meiotic crossing-over) so that sequences of a fragmented genome assembly used as reference may show variation patterns that continue across assembly gaps. In this way, continuous variation patterns (i.e., variant positions that group the individuals in a similar way) may serve as a guidance for ordering and orientating assembled contigs. Genomic positions that do not show any variants or noisy positions need to be removed, and variation patterns need to be summarized in order to reduce the number of patterns to be compared between contigs.
The main steps are: sequencing of a number of individuals, read mapping on the reference assembly, variant calling for each individual compared to the reference, filtering of uninformative or noisy genomic positions, summarizing variation patterns within genomic intervals, and comparison of the summarized variation patterns between contigs.
To process comparable genomic units, the assembled contigs are divided into segments of a fixed size and variation patterns are summarized and counted per segment (Fig. 2). Rare patterns (i.e., occurring only at one or two genomic positions within a segment) are removed, and the remaining summarized variation patterns are referred to as “fingerprint” per segment. The comparison between contigs is based on comparing such variation fingerprints, and contigs are ordered according to their shared fingerprints. Contigs are merged and/or arranged into scaffolds if physical evidence is available, for instance provided by supporting mate-pairs or long reads. Relying solely on physical evidence such as mate-pairs is often ambiguous due to large repetitive regions. These regions will either not be spanned by mate-pair reads or will be spanned by reads that have multiple alignment positions. Hence, haplotype-based information provides additional guidance and confirmation for scaffolding.
Detection of shared variation fingerprints using Arabidopsis data
As a proof of concept, we took advantage of previously generated genotyping data based on whole-genome sequencing of 1135 accessions of the model plant Arabidopsis thaliana (Alonso-Blanco et al. 2016). In the context of that project, single-nucleotide variation had been determined relative to the reference sequence of the A. thaliana Columbia (Col-0) genotype. Based on this variation information we determined variation patterns per position in the Col-0 reference assembly and generated variation fingerprints per segment of 100 kbp. The number of shared fingerprints between such 100 kbp segments was compared for sets of 15 to 100 randomly chosen accessions out of the 1135 available datasets (Table 1). With as few as 15 randomly selected accessions we found a genome-wide average of three shared fingerprints per megabase confirming the assembly (i.e., located in adjacent segments) and up to 0.30 shared fingerprints per megabase between remote segments not supporting the assembly structure (Table 1, Online Resource 1). Most of the segments that were correctly identified as being adjacent based on shared fingerprints were located in centromeric regions. When using 100 randomly selected accessions the number of shared fingerprints in adjacent segments increased to 16 per megabase distributed along the whole genome, with at most 1.90 remote relations per megabase. Using genotyping datasets from 50 randomly selected accessions resulted in 11–13 adjacent shared fingerprints and 1.4–2 remotely shared fingerprints per megabase. Thus, collecting haplotype information based on 50 re-sequenced accessions appeared as a good trade-off between the number of sequenced individuals, the number of fingerprints relating adjacent segments to each other correctly, and the false-positive rate of remote relations (Fig. 3).
Table 1 Subsets of A. thaliana populations used for haplotype-based scaffolding of the A. thaliana Col-0 reference genome Next, we assessed the impact of the genetic relatedness of the accessions on the detection of adjacent genomic segments based on shared fingerprints. To do so, we selected accessions from two distinct geographic areas instead of the random selection. Fifty Italian accessions showed a density of adjacent shared fingerprints (11.5 per megabase) that was similar to randomly chosen ones. However, we observed more than twice as many (21.9) adjacent shared fingerprints per megabase when choosing 50 accessions from Sweden (Table 1). The set of Swedish accessions comprised samples from all over the country (North and South) whereby the northern accessions displayed the lowest local pairwise genetic distances of any A. thaliana accessions sampled in Europe and also had the highest rate of rare alleles in the context of the whole 1135 accessions dataset (Alonso-Blanco et al. 2016). Southern Swedish accessions had higher pairwise distances compared to northern ones but also had fewer rare alleles. Accessions from northern Sweden were thus the ones that shared the most haplotypes. This was also the dataset with the best performance in terms of density of haplotype-based relations. Italian accessions were chosen based on close geographical proximity within northern Italy but are known to cluster with different populations distributed throughout Eurasia (Alonso-Blanco et al. 2016). The close genetic relatedness among Swedish accessions was reflected in the number of variants: there was less variation between Swedish A. thaliana accessions relative to the Col-0 reference when compared to randomly sampled accessions (Table 1). For example, there were 2.2–3.2 million variant positions in 50 Swedish accessions versus 3.9–4.4 in 50 randomly sampled accessions. This underlines the importance of the presence of conserved haplotypes over larger genomic distances rather than merely a high number of variants to increase the amount of adjacent shared fingerprints.
Quinoa genome assembly
We employed haplotype information for scaffolding of a de novo assembly of the quinoa genome which is about ten times larger than the genome of A. thaliana. In large repeat-rich genomes, contig ordering and scaffolding becomes a challenging task.
As a reference genome, we sequenced the genome of the Bolivian quinoa accession CHEN125, a typical white-seeded cultivar originally collected at Lahuachaca in the Bolivian Altiplano region at an altitude of 3800 m. Sequencing was performed on the Pacific Biosciences RS II sequencing platform, resulting in a dataset with an average read length of 10,464 bp and 65-fold genomic coverage (Table 2) assuming a haploid genome size of 1.45 Gbp (Kolano et al. 2012). The read length distribution had two peaks at 6800 bp and at 16,000 bp (Online Resource 2). The raw data were corrected and assembled using the Canu pipeline (Koren et al. 2017) resulting in an assembly size of 1.32 Gbp consisting of 6747 contigs with an N50 length of 608 kbp (referred to as "Qpac" assembly). Fifty percent of the Qpac assembly was comprised within 546 contigs (Table 3). Of a set of 2121 highly conserved plant genes used as input for BUSCO (Seppey et al. 2019; Simão et al. 2015) we found 2050 genes (96.7%) within Qpac, suggesting comprehensive coverage of the gene space in our assembly (Table 4). Of the 2050 BUSCOs, 184 were found in more than two copies (up to six copies) and 1287 BUSCOs appeared exactly twice. Close to 70% of BUSCO genes were duplicated, which reflects the tetraploid nature of the quinoa genome. Indeed, despite assembling like a diploid genome, quinoa originated from the hybridization of two Chenopodium species. Homeologous sequences were successfully assembled separately, as reflected in the assembly size of 1.32 Gbp which corresponds to twice the haploid size of its parental genome donors. Consequently, most BUSCO groups were detected twice, because they are present in separate homeologous chromosomes.
Table 2 Raw data for de novo sequencing of C. quinoa CHEN125 and re-sequencing of 50 quinoa accessions Table 3 Assembly metrics for Qpac (assembled by Canu) and RefCHEN125 (after haplotype-based scaffolding) and the metrics of other existing quinoa assemblies (ASM168347v1: Jarvis et al. 2017; Cq_real_v1.0: Zou et al. 2017; Cqu_r1.0: Yasui et al. 2016) Table 4 BUSCO analysis of the Qpac quinoa assembly We mapped Illumina paired-end data from the same accession to the PacBio assembly and found that 450,353 sites were heterozygous (indels and SNPs included) corresponding to 0.034% of the total genome.
We estimated the genome size of quinoa accession CHEN125 based on the sequencing data by taking into account the k-mer frequency distribution. Based on 17-mers, we estimated a diploid genome size of 1.22 Gbp (Online Resource 11). Similar results were obtained for estimations based on k = 19 to k = 27 (between 1.23 and 1.27 Gbp). Quinoa is an allotetraploid plant, yet its subgenomes are sufficiently different so that its genome essentially behaves like a diploid (Schiavinato et al. 2021). The estimated genome size of 1.22 Gbp and the assembly size of 1.32 Gbp were in good agreement.
Assembly integration with a public genetic linkage map
In order to assign our assembled contigs to the 18 quinoa chromosomes we downloaded sequences that had been previously used as markers for generating a genetic map of the quinoa genome based on single-nucleotide variation (Maughan et al. 2012) and located them in the Qpac assembly. Of 511 available markers specifying 29 linkage groups, 356 aligned uniquely to Qpac contigs. In total, 282 Qpac contigs comprising 270 Mbp (20.5% of the Qpac assembly) could be unambiguously assigned to a linkage group. Of these contigs, 225 contigs carried one marker and 57 contigs carried between two and five markers (Online Resource 3). The fraction of chromosomally assigned contigs could be substantially extended after re-sequencing a set of 50 publicly available quinoa accessions for variant calling and haplotype detection: When haplotypes spanned several contigs, connections to yet unassigned contigs could be established (see below).
Sequencing of 50 quinoa accessions and variant calling
We selected 50 quinoa accessions from Bolivia and Peru originating from the vicinity of Lake Titicaca for low-coverage whole-genome sequencing (Online Resource 4). Seed material of these accessions is available in public repositories so that accessions can be re-grown if needed. Sequencing of genomic DNA was performed on the Illumina platform resulting in on average 32 million read pairs per accession, translating into an average of 5.7-fold genomic coverage (range 4.3-fold to 7.4-fold) before filtering. Quality filtered reads (mean coverage fourfold to sevenfold) from accessions were aligned to the reference genome at an average mapping rate of 47 to 59% (aligning uniquely as pairs within the expected insert size to the genome). The initial variant calling before further filtering using Qpac as reference resulted in a total of 5,005,967 variant positions of which 596,685 were insertions or deletions (indels) and 4,409,282 were single-nucleotide polymorphisms (SNPs) (Online Resource 5). These were 1.44 million variants per accession on average of which 62% were homozygous and 38% were heterozygous. For comparison, we applied the same approach on a publicly available quinoa assembly of Chilean cultivar QQ74 (Jarvis et al. 2017), referred to as ASM168347v1, yielding 594,194 indels, 4,706,603 SNPs, and 1.89 million variant positions on average per accession (68% homozygous, 32% heterozygous). Considering that the ASM168347v1 assembly is based on a quinoa cultivar from Chile, while the Qpac reference and the re-sequenced samples originated both from the Bolivian and Peruvian Altiplano region, a higher degree of variation of the re-sequenced lines relative to ASM168347v1 was expected. The slightly lower amount of indels in ASM168347v1 could be explained by lower mapping rates to this genome (0.5–1% less) and more reads mapped as single reads (i.e., 1–2.5% more reads that could not be aligned as pairs because of one of the mates not mapping within the expected insert size range). This could indicate structural variation that is hard to detect with short Illumina paired-end sequencing data.
The transition/transversion ratio of variants detected using either reference had the same value of 1.50. The five accessions with the highest number of variants were the same for both genome references and originated from the Acomayo Province (Peru). Acomayo is located in the Cusco region that shares a border with the Ayacucho region where the domestication of quinoa is assumed to have occurred 5000 years ago (Lumbreras et al. 2008). The average density of variation of 3.34 positions per kbp (CHEN125) to 3.53 positions per kbp (QQ74) in the quinoa genome is comparable to variant densities in other domesticated plants (Xu et al. 2019) and is much lower than in wild plants such as A. thaliana (101.95 positions per kbp) (Alonso-Blanco et al. 2016).
There were 3,589,483 variant positions in Qpac when considering only biallelic SNPs covered by at least half of the 50 individuals and with minor allele frequency > 5%, which was the set of SNP data used for further analysis in the context of haplotype-based scaffolding. With these settings we observed 1.18 million variant positions on average per accession of which 61% were homozygous and 39% were heterozygous.
Ordering of contigs based on haplotype information
The variants along the assembled contigs of Qpac were analyzed in segments of 100 kbp in size. Each contig smaller than 100 kbp (4390 contigs totaling 185 Mbp) was considered a single segment. In total, we obtained 15,204 such segments for our 1.32 Gbp assembly. For each segment we determined a variation fingerprint and performed an all-against-all comparison of fingerprints per segment. Some segments did not display variation and thus had no fingerprint. There were initially 12,366 segments distributed over 3172 contigs showing shared variation patterns with at least one other segment. Highly abundant patterns were removed as well as patterns that were supported only by few positions within a segment (presumably occurring due to rare mutations, recent mutations or wrong variant calls). Fingerprints between segments had to share at least 10% of their variant patterns to be considered as a sufficiently supported haplotype-based relation. After filtering out weakly supported relations of segments there were 2188 contigs that shared variation fingerprints with at least one other contig. Such contigs formed 371 groups consisting of two to 82 contigs (Online Resource 6). In total, the 2188 contigs represented 907 Mbp, i.e., 69% of the assembly. Since variation fingerprints may span several contigs, the local order of contigs was not always resolvable. Among the 371 groups, 306 formed small connective networks that could be unambiguously resolved into a linear structure (Fig. 4). These 306 groups contained 849 contigs (2–11 contigs per group) and encompassed 350 Mbp. Networks of the remaining 65 groups comprising 1339 contigs were more complex (Online Resource 7). To obtain the final assembly RefCHEN125 we only scaffolded or merged contigs for which actual physical evidence was available, i.e., which had their haplotype-based order supported by linking mate-pairs, long reads or overlapping contig ends (see below). The contigs that could not be physically linked remained within a contig group (reflected in the name of the assembly sequences). In these cases, haplotype information does not allow the determination of the exact order of contigs, but merely provides information about their adjacency. By our definition, contig groups may consist of bona fide scaffolds, but may also contain contigs without precise positional information.
Among the groups connected by variation fingerprints, 123 contained contigs previously anchored to genetic markers (Online Resource 6, Online Resource 3). There was general agreement between the genetic markers and the groups formed by variation fingerprints. For example, five among 38 contigs extending the 3.1 Mbp contig “tig00000000,” previously anchored to linkage group 2 (LG 2), carried genetic markers all of which were consistently from LG 2. Of the total of 282 genetically anchored Qpac contigs (270 Mbp), there were 214 contigs that could be extended by 1206 additional contigs (452 Mbp) via shared variation fingerprints (Table 5). Remaining groups not anchored to genetic markers encompassed 250 Mbp in 768 contigs within 248 groups.
Table 5 Fraction of Qpac contigs that could be related to other contigs using haplotype information and/or could be anchored to linkage groups using genetic markers In summary, of the 1.32 Gbp quinoa Qpac assembly, 908 Mbp showed improved ordering after exploiting shared variation fingerprints (i.e., haplotype information). The remaining 4559 contigs (411 Mbp) with a mean size of 90 kbp lacked variation so that fingerprint comparisons could not be performed. However, of these, 68 contigs in 65 Mbp had already been chromosomally anchored based on genetic markers. The total genetically anchored fraction of the assembly after considering haplotype information was 722 Mbp comprising 55% of the Qpac assembly.
Verification of haplotype-based relations
The order of contigs based on haplotype information was verified based on consistency with Illumina mate-pairs, long reads, and/or overlapping contig ends, in comparison with another quinoa assembly, and by the assignment to linkage groups via genetic markers.
Among the 123 contig groups that also contained at least one genetic marker, only three groups contained markers from two different chromosomes. This was the case for Qpac contig “tig00007425” (965 kbp) previously assigned to LG 1 but related to several contigs assigned to LG 6 by haplotype information. However, “tig00007425” also contained a uniquely aligning genetic marker from LG 6 with a lower mapping score (Online Resource 3). The LG 6 marker Cq04033_520 mapped close to the left end of the contig and the LG 1 marker Cq07822_1014 mapped near to the right end of the contig. All contigs related via variation fingerprints and assigned to LG 6 emerged from the left end of “tig00007425.” Upon inspection of the read coverage we noticed a coverage drop between the two markers, indicating the possible location of a misassembly in “tig00007425.” Similar cases were found in the two other groups containing contradictory linkage group assignments; these groups also showed coverage drops between contradictory markers (“tig00005716” assigned to LG 12 grouped with four LG 15 assigned contigs and “tig00008618” assigned to LG 3 grouped with two LG 10 assigned contigs). Thus, apart from confirming available LG assignments, the haplotype information led to the identification of three cases of potential misassemblies which were resolved by splitting the respective contigs. On the other hand, 36 haplotype-based contig groups contained at least two genetic marker from the same LG and confirmed each other.
The group comprising the largest number of contigs included two LG 1 contigs and a total of 82 contigs (13 Mbp in total, Online Resource 6). The largest group in terms of basepairs included 13 contigs on LG 5 and encompassed a total of 74 contigs (40 Mbp). This group was also the one with the highest number of contigs containing genetic markers, i.e., 13 markers in total. Overall, the organization of contigs based on haplotype information confirmed the LG assignments based on genetic markers.
The contig groups were overall in agreement with the chromosome-level quinoa assembly ASM168347v1 from the Chilean cultivar QQ74 (Jarvis et al. 2017). For example, when aligning all contigs and contig groups assigned to LG 2 (44 Mbp in total) to the ASM168347v1 assembly, most of them had an end-to-end sequence alignment in chromosome 2 (Fig. 5). The ASM168347v1 chromosome 2 is 59 Mbp in length, including 2 Mbp of unspecified bases. While networks created by haplotype information could not fully determine the order and orientation of each contig, the overall order of larger contigs and their chromosome assignment were consistently confirmed by ASM168347v1. The scaffolding of CHEN125 permitted the highlighting of three large-scale inversions in chromosome 2 when compared to QQ74, which were strongly supported by paired-end data (Fig. 5, blue lines).
To further validate the haplotype-based contig ordering and orientation we sequenced Illumina mate-pairs of five different span sizes ranging from 2.5 kbp to 8.1 kbp (Online Resource 8). We also took overlaps between contigs and spanning long reads into account. Among all the haplotype-based relations (6435), 3351 had at least one supporting long read. Long reads had a size similar to the largest mate-pairs library insert size (about 10 kbp) so that connections between contigs overlapping by more than 10 kbp or between contigs that had misassembled ends longer than 10 kbp could not be spanned by the reads. In total, we connected 1,848 contigs into 790 sequences containing two to 13 contigs. In total, 1,298 connections between contigs were confirmed by mate-pairs, 416 connections were confirmed by overlaps, and 712 connections were confirmed by both mate-pairs and overlaps. Additionally, long reads confirmed 1480 connections already supported by mate-pairs or overlaps. Overall, among the 2188 contigs related by haplotype information 84.5% had confirming physical evidence in the form of bridging mate-pairs, matching long reads, and/or overlaps to establish their connections.
Scaffolding and merging of quinoa contigs
We integrated information from overlaps, mate-pairs, and long reads to improve the contiguity of the Qpac assembly and obtained a scaffolded version called RefCHEN125. The RefCHEN125 assembly had an N50 length of 1079 kbp, a total size of 1.314 Gbp, and encompassed 5689 sequences (Table 3). The difference in size between RefCHEN125 in respect of Qpac was mainly due to contig overlaps supported by mate-pairs or long reads. Because of the span sizes of our mate-pair libraries (max. peak size 8 kbp) that were in the same range as the PacBio subread length, the detected links resulted more often in merging than in insertion of gaps. The main improvement arose from 2188 contigs being reordered according to their adjacency in the genome inferred from haplotype information. The Qpac assembly initially provided contiguous information for a maximal stretch of 7.3 Mbp, whereas RefCHEN125 provided haplotype-based evidence for ordered contigs in large regions of up to 39.7 Mbp and scaffolded contigs in regions of up to 10.5 Mbp. The sequences in the FASTA file of RefCHEN125 were ordered and renamed to reflect their position within the assembled genome.