Introduction

Beet Curly Top (BCT) is a viral disease prevalent throughout the Western US and Middle East. The beet curly top virus (BCTV) is a parthenogenic Curtovirus transmitted by the beet leafhopper (Circulifer tenellus). BCTV infection can cause devastating losses for beet growers and threatens the profitability of the sugar industry in areas where the disease is common. Currently, BCT disease pressure is managed by a combination of genetic resistance present at varying levels within commercial varieties and management practices in order to control vector populations such as insecticide sprays and seed treatments (e.g., Poncho-Beta, Cruiser and NipsIT) (Strausbaugh et al. 2012, 2014). Genetic resistance to BCTV is a quantitative trait (Panella et al. 2014), and the breeding and maintenance of varieties with durable genetic resistance is complex due to the need to continuously incorporate resistance genes for multiple pathogens simultaneously. Breeding varieties with durable genetic resistance is becoming more important due to the potential loss of effective chemicals to control the disease in fields which will likely increase in severity and as a result could compromise the productivity and profitability of the crop for growers (Strausbaugh et al. 2006). A fundamental understanding of the molecular mechanisms (e.g., genome variation, genes, and pathways) could allow for the development of a genome informed approach to improve resistance in varieties and address this long-standing challenge to sugar beet cultivation in regions where BCTV is problematic.

The current state of genomic resource development for beet is ongoing. A plethora of research in diverse systems has demonstrated how this technology can interface directly with crop improvement (Bevan et al. 2017). The construction of reference genomes is also evolving; long-read technologies provide solutions to the challenges encountered when assembling highly repetitive plant genomes using short reads. Several reference genomes exist for beet which highlights this progress (Dohm et al. 2014; McGrath et al. 2020). High-quality reference genomes in beet have provided a lens to measure genome diversity (Galewski and McGrath 2020) as well as assess the content of important functional genes such as those involved in disease resistance (Funk et al 2018). Comparing genome assemblies requires relatively complete genome sequences in order to evaluate the content and organization of genomic variation within genomes. Only recently has sequence data become available which looks at genome diversity in primary (Galewski and McGrath 2020), secondary (Galewski and McGrath 2020), and tertiary (del Rio et al. 2019) gene pools of sugar beet.

PacBio long reads have resulted in more contiguous assemblies and provides a means to investigate larger structural variation, inversions, and their resulting phenotypic consequences. Structural variation (SV), including indels, likely contributes more to lineage divergence than single-nucleotide polymorphisms (SNPs) due to a larger potential for these variants to alter gene function (Chen et al. 2009). Long reads have a greater ability to detect SV and chromosomal inversions than short reads. Recent studies have shown associations between large chromosome-scale structural variants with adaptive and agronomic trait variation in plants (Schiessl et al. 2019). In Brassica napus, up to 10% of all genes were affected SV events. The complex nature and highly duplicated nature of plant genomes makes long-read technology well suited for addressing SV in a high-throughput manner (Chawla et al. 2020).

One hope of developing genomic resources in beet is that useful genetic markers can be extracted to facilitate in the breeding and improvement process. Genetic markers can facilitate the introgression of important characters such as BCTV resistance from resource populations (e.g., mapping populations, germplasm resources, and wild relatives) into elite cultivars. Generating sequence data that captures causal variation should be considered in any project which can lead to a better understanding of the underlying molecular mechanisms, provide targets for gene manipulation, and validation of gene function. This paper represents a pipeline for the fast and efficient characterization of beet genomes using Pac-bio long-read sequencing to ultimately inform the genetic basis of important agronomic characters. The double-haploid KDH13 (PI 663862) was created from a single plant selected for BCTV resistance from the population C762-17 (PI 560130) (Eujayl et al. 2016). This line was sequenced and assembled to capture the genetic variation associated with BCTV resistance. Additionally, Illumina sequencing data was generated for KDH13, KDH19-17, and an F1 hybrid (KDH13 × KDH19-17) in order to determine informative genomic markers that can be used in the analysis of F2 families and breeding for improved BCTV resistance.

Materials and Methods

KDH13 Plant Material

Seeds from the sugar beet (Beta vulgaris L.) doubled-haploid line KDH13 (PI 663,862) were planted in Sungro Horticulture Professional Growing Mix, containing Canadian sphagnum peat moss, perlite, dolomite lime, and a long-lasting wetting agent, RESiLIENCE. Beets were grown in a growth control chamber at 22 °C, then moved to a greenhouse after three and a half months.

KDH13 Genome Sequencing

A single KDH13 sugar beet plant was dark treated for 48 h in order to enhance the efficiency of DNA extraction by depleting sugar reserves and inhibiting secondary metabolite synthesis. Whole leaves were placed into 50-ml centrifuge tubes and filled with liquid nitrogen and stored at − 80°. Samples were shipped overnight on dry ice to GENEWIZ Plainfield (NJ, USA) for nuclei isolation, extraction of high molecular weight (HMW) DNA, and library preparation. A 20-kb SMRTBell Library was prepared and sequenced using a total of 3 SMRTcells on the PacBio Sequel platform. Read lengths generated from the PacBio sequencing run were plotted using the program QUAST (Mikheenko et al. 2018).

KDH13 Genome Assembly

The resulting reads were used to assemble a de novo genome for the BCTV resistant double-haploid, KDH13, using the program Canu1.7 (Jayakumar and Sakakibara 2019). Canu1.7 was run with default parameters using an expected genome size of 758 Mb (Arumuganathan and Earle 1991).

KDH13 Genome Evaluation

Several programs were used to evaluate the KDH13 assembly and compare with other current beet genome assemblies (e.g., RefBeet1.2 and EL10.1), representing genome sequences resulting from different genetic backgrounds and sequencing and technologies. Genome statistics were generated using the script assemblethon.pl (https://github.com/ucdavis-bioinformatics/ assemblathon2-analysis) (Bradnam et al. 2013). The program BUSCO was used to evaluate the completeness of these genomes by the identification of kingdom-specific core orthologs found across other “gold standard” plant reference genomes (Seppey et al. 2019).

KDH13 Functional Genome Predictions

Ab initio gene prediction was carried out on the KDH13 genome assembly using AUGUSTUS version 3.3.2. Current HMM models representing the Pfam-A dataset were downloaded from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/-releases /Pfam33.1/Pfam-A.hmm.gz, and the predicted proteins were annotated using the program PfamScan which used the HMM protein family signatures and domains within the KDH13 genes predicted with AUGUSTUS. The program liftoff (Shumate and Salzberg 2020) provided additional evidence to gene model structures by transferring the annotations from the reference EL10.1 to the KDH13 genome assembly.

Genome Order

The program Blastx was used to identify physical positions of the protein predictions within the EL10.1 genome. Positional information in the resulting blast file allowed the proteins to be used as anchors to determine a more contiguous order of KDH13 contigs to the EL10.1 genome, which was scaffolded using proximity information such as Hi-C and BioNano optical maps. Custom scripts were used to evaluate the output of Blastx using the KDH13 predicted proteins with the ordered EL10.1 protein prediction and determine a more contiguous order of the KDH13 assembly, which related to accepted linkage groups (Schondelmaier and Jung 1997) and the chromosome nomenclature detailed in Butterfass 1964. Additionally, the ordered genome produced was validated by comparing the order of the EL10.1 genome annotation transferred using the program liftoff (Shumate and Salzberg 2020).

Analysis of Genome Variation

This variation was used to assess unique variation within KDH13 genome. Illumina genome sequencing was carried out for three different genotypes, KDH13, KDH19-17, and F1 (KDH13 × KDH19-17). All reads were filtered and trimmed in a sliding window based on the average quality parameter (Q > 20) using Trimmomatic (Bolger et al. 2014). Filtered reads were used for all downstream analysis. BWA (Li 2013) was used to align Illumina short-read data for DH line KDH13 to the EL10.1 genome. Variant calling was performed on the bam files using the program Freebayes (Garrison and Marth 2012). Variants were filtered for quality and read depth (e.g., read depth, N > 10 and N < 300, QUAL ≥ 20, and mapping quality ≥ 60). The program Manta (Chen et al. 2016) was used to detect structural variation within KDH13, KDH19-17, and F1 (KDH13 × KDH19-17) using Illumina data. The density of variation between the two parents KDH13 and KDH19-17 was estimated to give an idea of the divergence between the two parental lines. Genomescope 2.0 (Ranallo-Benavidez et al. 2020) was used to evaluate k-mer spectra produced by the program Jellyfish (Marçais and Kingsford 2011) for each Illumina dataset. The K-mer data provided estimates of the size of the genomes, content of repetitive sequences, and the quantity and distribution of heterozygosity.

Relationship Determination

Biallelic SNPs were used to calculate pairwise relationship coefficients between lineages (both inbreed and outbreed) using the Kinship Inference for Association Genetic Studies (KING) package (Manichaikul et al. 2010). Data for additional beet populations also used in clustering was derived from Galewski and McGrath (2020).

Visualization of KDH13 Genome (Python and R)

Genome order was deduced from alignment of KDH13 predicted proteins with known locations on assembly contigs with scaffolds of the EL10.1 genome assembly, effectively using the protein sequence as anchors and custom scripts to determine the position and order of predicted proteins within the EL10.1 genome. Custom R scripts were written to visualize the KDH13 genome relative to other beet genomes (e.g. Beta vulgaris crop types (Galewski and McGrath 2020), F1 (KDH13 × KDH19-17), and KDH19-17 lineages. Gene density was extracted from the EL10.1 gene set. Gyspy and Copia repeat density was extracted from LRT-retriever (Ou and Jiang 2018). The centromeric repeat Beetle7 was used in cytogenetic studies of beet (Kowar et al. 2016) and was located within the EL10.1 genome using Blastn. The blast results were parsed and plotted against the EL10.1 genome.

Phenotypic Data

Phenotypic evaluation of curly top resistance was carried out using the scale detailed in Mumford 1974. Individual beets from the resistant parent (KDH13) susceptible parent (KDH19-17) and F1 populations were grown in the greenhouse, infected with BCTV via beet leaf hoppers and evaluated for disease severity.

Results

PacBio sequencing was used to generate long reads (~20 kb) for the sugar beet double-haploid KDH13. Using these reads, a genome was assembled using the program Canu 1.7. This new reference genome assembly served to capture sequence variation for a genome associated with resistance to the BCTV. To evaluate the quality of the KDH13 genome, comparisons with other published beet genomes were carried out against the EL10.1 and RefBeet-1.2.2 genomes. The KDH13 contigs were ordered relative to the EL10.1 assembly using predicted proteins as anchors to find their genomic locations within EL10.1. Finally, Illumina data was used to give insight into the content and distribution of genome variation contained within genomes of KDH13, KDH19-17, and an F1 hybrid (KDH13 × KDH19-17). These data produced informative markers for characterizing these lines and for evaluating the segregation of durable genetic resistance to BCTV in F2-derived populations.

KDH13 Sequencing, Genome Assembly, and Evaluation

In total, 14.8 Gb was produced from the PacBio Sequel II sequencing platform resulting in an average genome wide coverage of 19.54 × using an estimated genome size of 758 Mb. The average read length was 20,901 bp (Fig. 1). The de novo genome assembled from PacBio reads resulted in a genome size of 573,479,205 bp in length. The Assemblethon.pl script allowed direct comparison of genome assemblies between different genotypes and sequencing strategies (Table 1). Comparisons between KDH13 and previously sequenced reference genomes show similar genome sizes were recovered by all methods (520–573 Mb) (Table 2). Consistently, these values fall short of the 758 Mb predicted size. Long-read single-molecule sequencing technology shows a large reduction in the number of scaffolds in PacBio KDH13 genome assembly (N = 6246) versus Illumina RefBeet-1.2.2 (40,246). The EL10.1 genome which in addition to PacBio reads used Bio Nano optical maps and Hi-C proximity information, produced a scaffold number equal to the haploid chromosome number (N = 9). Similarities in nucleotide percentages were observed between genomes with a reduction in RefBeet-1.2.2 as a result of the addition of Ns to the genome in order to link scaffolds into pseudomolecules using a linkage map.

Fig. 1
figure 1

Read length distribution of PacBio sequence data

Table 1 Description of the previously sequenced Beta vulgaris genomes and materials
Table 2 Beta vulgaris genome assembly comparisons

Gene Content Comparisons

BUSCO compared the presence of 425 conserved ortholog genes across the different assemblies. The presence of core ortholog genes tests the completeness of each genome assembly. This may result from differences in assembly methods and sequencing strategies rather than genotypes used. RefBeet-1.2.2 showed the largest number of complete BUSCOs, followed by EL10.1. The KDH13 genome was less complete than EL10.1 or RefBeet-1.2.2 and had the greatest proportion of missing BUSCOs (7.3%) and fragmented BUSCOs (8.0%). The KDH13 genome did not use Illumina short-read data in the assembly (Table 3).

Table 3 Results of BUSCO analysis

Functional Genome Predictions

Ab initio gene prediction using AUGUSTUS produced 80,342 protein coding genes. Based on what has been recovered in EL10.1 (24,255) and RefBeet-1.2.2 (27,421), this figure is high, suggesting additional CDNA and EST evidence for accurate genome annotation. Of these predicted genes, 40,779 contained one or more Pfam domains and a blast of these proteins to the EL10.1 predicted protein set showed 27,847 of the predicted proteins aligned to the EL10.1 protein set with confidence levels assessed by both e-values < 1 × 10−60 and a sequence similarity > 75%. Additionally, gene annotations were transferred from the EL10.1 genome to KDH13 using the program liftoff. In total, 23,389 (96.4%) of the 24,255 genes from the EL10.1 were placed on the KDH13 genome assembly. Together, these datasets offered a picture of the gene content, protein coding capacity, and functional regions within the KDH13 genome.

Order of Genome Contigs

The protein predictions from the KDH13 genome were used as anchors to place KDH13 contigs in the order and orientation of the EL10.1 genome. For contigs with one or more protein predictions, blastx results were evaluated for percent identity and e-value to determine a genomic location within EL10.1. In total, 4681 of the 6245 contigs (75%) were placed in the order and orientation of the EL10.1 genome which was determined using Bio Nano optical maps and Hi-C proximity information. The number of contigs placed on each chromosome is present in Table 4. The 25% of contigs which showed less confidence for placement in a more contiguous order were left as un-scaffolded. The contigs that were placed in the correct sequence were equal to 502,929,268 bp or 87.7% of the total size of the KDH13 genome assembly. The success of this approach was confirmed by the congruence with the order of EL10.1 gene order produced by liftoff. The unanchored contigs were difficult to place in the genome using either method.

Table 4 Number of contigs placed in the order and position relative to the EL10.1 reference genome

Genome Variation

KDH13 Illumina reads were mapped to the EL10.1 genome. As quality thresholds were adjusted, coverage was reduced, suggesting up to 90.7% was highly similar and 5.4% of the genome had no coverage even using low mapping quality thresholds. This small percentage may account for the divergence between KDH13 and EL10.1 as alignments were not predicted due to the accumulation of sequence polymorphism but may result from sequencing error (Table 5). Mean depth across the genome was 22.34 × of the EL10.1 genome. The standard deviation was 122.91 with a depth of coverage ranging from 1 to 9061 across sites. Genomescope 2.0 showed the k-mer spectra of the F1 (KDH13 × KDH19-17) was intermediate to the respective parents (Fig. 2). The genome size, degree of homozygosity, and number of repetitive sequences are detailed in Table 6. The heterozygosity of the F1 was 1.43%. This represents a sevenfold increase in heterozygous loci relative to the double-haploid parents KDH13 and KDH19-17 (0.19% and 0.20%).

Table 5 Alignment of KDH13 Illumina sequence data to EL10.1 genome using different quality thresholds
Fig. 2
figure 2

K-mer spectra of KDH19-17, F1 (KDH13 × KDH19-17), and KDH13

Table 6 Results of K-mer spectra analysis using Jellyfish and Genomescope 2.0

KDH13, KDH19-17, and F1(KDH13xKDH19-17) Genome Variation

A variant calling and filtering pipeline was used to determine the variation specific to KDH13 and KDH19-17 populations. Unique variation contained within the KDH13 genome likely conditions or is linked to BCTV resistance. In total, 3,086,720 variants were detected. Of these, 2,259,324 (73.2%) were classified as SNP, 191,448 (6.2%) as insertions, 198,057 (6.4%) as deletions, 268,090 (8.7%) as complex substitutions, 90,004 (2.9%) as multi nucleotide polymorphisms (MNP), and 79,797 (2.6%) as SV (Table 7). Of the total variants detected in the parental genomes KDH13 and KDH19-17, 76,388 (6.1%) and 234,262 (13.2%) were detected as heterozygous respectively. Several regions appeared to contain greater heterozygosity than expected across the genome. Most notably were regions on chromosome 3 and chromosome 7 for KDH19-17 (Fig. 3). The variants were plotted to determine if this result was biological (e.g., repetitive regions) or technical (e.g., assembly or alignment) in nature. The similarity between the KDH13 and EL10.1 genome was apparent. Less variants were detected in KDH13 relative to KDH19-17. The number of variants that can discriminate between KDH13 and KDH19-17 was estimated to be 1,272,863 (40.2%). The additional variant sites were similar between two parents and lacked the ability to discriminate between parents either as a result of fixation for the same variant allele and/or segregation within the parental populations. Relationships were determined between KDH13, KDH19-17, F1 and various accessions representing cultivated Beta vulgaris (Fig. 4). A major difference in hierarchical clustering suggests breeding history plays a major part in divergence. Two groups could be identified: an inbreed group consisting of C869, W357B, KDH13, KDH19-17, F1 and an outbreed group of pooled populations. Divergence via inbreeding appeared to have a large effect on clustering, which appears to be more important than shared ancestral lineage. Most obvious is the divergence between the C869 single plant versus C86925, which represents a pool of 25 individuals. KHD13 resolves closer to C869 than KDH19-17; this was also seen in an increase in the number of indels and SNP shared between these two lines, suggesting a greater degree of shared variation. The population C869 represents the population from which EL10.1 was derived. When clustering the data based on relationship coefficients, the F1 resolved intermediate to the two parents KDH13 and KDH19-17, providing additional support for the hybrid nature of the F1 genome.

Table 7 Variation detected within KDH13, KDH19-17, and F1 (KDH13 × KDH19-17) using reference genome EL10.1
Fig. 3
figure 3

Plot of genome variation in KDH13, KDH19-17, and F1 (KDH13xKDH19-17) relative to the Beta vulgaris EL10.1 genome

Fig. 4
figure 4

Hierarchical clustering based on relationship coefficients determined between KDH13, KDH19-17, and F1 (KDH13 × KDH17-19) relative to other sequenced genomes representing the cultivated lineages of Beta vulgaris

Phenotypic Data (KDH13, KDH19-17, and F1 (KDH13 × KDH19-17)

KDH13 showed a greater degree of resistance to BCTV relative to KDH19-17 (rated 2.0 on a scale of 0 to 9). No variability within these populations suggests a highly reproducible resistance phenotype has been captured within the KDH13 genetic background. KDH19-17 proved to be susceptible (rated 7.0), making it a good parent in order to develop F2-derived populations where recombinants of the parental genomes can be evaluated. The F1 population all showed an intermediate phenotype (rated 3.5) (Fig. 4). Although comparisons of phenotypic data between contrasting parents and the F1 cannot determine whether alleles have dominant, codominant, and recessive gene action of alleles, the intermediate phenotype of the F1 indicates the genetic variation as conditioning resistance to BCTV is heritable. Both the phenotypic data and marker data support the hybrid nature of the population and further analysis of F2-derived populations in order to determine what genomic variation influences BCTV resistance (Fig. 5).

Fig. 5
figure 5

Beet curly top disease scores for F1 (KDH13 × KDH19-17) and parental lines KDH13 and KDH19-17

Discussion

Long-read PacBio sequence data was generated for the assembly of a genome for BCTV-resistant genetic stock KDH13 (PI 663862). The ability to capture of genome sequence data associated with agronomically important lineages is the backbone of a “genome-informed” approach to crop improvement. The knowledge and resources developed from genome sequencing can make plant breeding, genetic resource conservation, and functional genomics programs more effective and efficient. The identification and sorting of functional variation within genomes associated with crop productivity and sustainability allows the efficient transfer traits into elite cultivars via genetic markers. If traits are understood from a molecular perspective or by association, traits such as durable genetic resistance can be deployed more rapidly to address grower challenges.

Sequence data of KDH13, KDH19-17, and an F1 individual (KDH19-17 × KDH13) (F1) allowed for determination of KDH13-specific variation. In total, 1,272,863 variants were informative for distinguishing parental genome segments in recombinant F2-derived populations. Methods which look at phenotypic bulks within segregating F2-derived populations will likely be able to reduce the number of variants to a more tractable number for breeding. In fact, relatively simple population designs coupled with next-generation sequencing have been successful in identifying causal variation in beet (Ries et al. 2016) using a mapping by sequencing approach (Hartwig et al. 2012). The ability to produce DH lines and inbreed populations can provide a stable source for important characters which can be readily used by seed companies as a source to enhance characters such as disease resistance.

The broad similarity observed between sugar beet genomes RefBeet-1.2.2, EL10.1, and KDH13 was evident. This included a high read alignment percentage, clustering based on relationship coefficients, relative proportions nucleotides, and genome assembly size. More in-depth comparisons regarding comparisons between EL10.1 and RefBeet-1.2.2 are available in McGrath et al. (2020). Sugar beet genomes share a high degree of genome variation and cluster at the population level (Galewski and McGrath 2020). Perhaps not surprising given the high degree of conservation of gene order that was observed within phylogenetically distant beta taxa (del Rio et al. 2019), as well as the known domestication history and putative genetic bottlenecks which served to fix important traits but may limit the potential for future improvement (Hoffmann and Kenter 2018). Using different sequencing technologies may be a way to characterize and manage diversity better (e.g., SNP and indels using Illumina, SV and chromosome inversions using PacBio, and chromosome level variation, using Hi-C, Bio Nano, and linkage maps). Diversity can be deleterious as well as beneficial; it is needed for improvement because selection serves to purge diversity as beneficial traits become fixed. The genetic base of modern corn lineages exhibits a 43% reduction in genetic diversity relative to progenitor populations (Wright et al. 2005). Boarding the genetic base of important agricultural crops including has been proposed in beet in order to gain important agronomic characters such as disease resistance (Panella and Lewellen 2007). This has been identified as an issue in many crops to the degree that “re-domestication” efforts are now being considered (Fernie and Yan 2019).

Long reads have been critical for characterizing structural variants in crop species. PacBio has been used in major crops to address this question by examining chromosome level variation within the species (Sun et al. 2018; Zhang et al. 2016). High-quality reference genomes of beet appear to require additional technologies to achieve chromosome level scaffolding such as Hi-C and Bio nano optical maps. Genetic linkage maps were used to order scaffolds of the RefBeet-1.2 genome assembly. Evaluation of beet genome assemblies has benefited from knowledge of physical cytogenetic marks discovered using florescence in situ hybridization (Paesold et al. 2012). Genome quality was highly variable as a result of different sequencing strategies and technologies. RNAseq was generated for the EL10.1 and RefBeet-1.2.2 which provided additional evidence for gene models. The RefBeet-1.2.2 assembly generated full-length CDS from single-molecule sequencing which may have increased the accuracy of gene annotation and estimates of gene content. RNAseq will continue to be an effective tool to annotate genes present in diverse genomes but not the reference. Pan genome analysis is becoming necessary to determine how genome composition affects phenotypic differences within and between lineages (Tao et al. 2019). In beets, as in many important crops, disease resistance is the hallmark of breeding efforts and critical for sustainability and future production of the species within conventional agricultural systems. Many of these genes are likely highly duplicated and may be located in repetitive regions of the genome (Funk et al. 2018). This makes long-read sequencing more attractive for defining these targets. As new material is sequenced for insight into traits of agronomic importance, we can assume greater levels of chromosome level variation will be present with greater genetic distance. The KDH13 genome provides a resource and insight into variation of high agronomic value which may not be present in current reference genomes. Future work will determine if PacBio sequencing alone can capture the unique genomic variation associated with BCTV resistance and whether the positional information and gene models leveraged from existing reference genomes is sufficient to compare the order and orientation of genome variation as well as functional gene content between diverse beet genomes.