Alignment of 12 potato landrace and wild genomes against two reference genomes shows greater overall match with DM1-3 than with M6
To detect structural variation in the genomes of potato landraces from the GenBank at the International Potato Center (CIP, Lima Peru), genomic DNA was sequenced from a panel of 12 accessions. These accessions were chosen to include representative individuals from each of the seven species, nine taxa and one wild relative proposed by Hawkes (1990). Six are diploids: Solanum stenotomum subsp. goniocalyx (GON1), S. stenotomum subsp. goniocalyx (GON2), S. phureja (PHU), S. xajanhuiri (AJH), S. stenotomum subsp. stenotomum (STN) and S. bukasovii (BUK); two triploids: S. juzepczukii (JUZ) and S. chaucha (CHA); three tetraploids: S. tuberosum subsp. andigenum (ADG1), S. tuberosum subsp. andigenum (ADG2) and S. tuberosum subsp. tuberosum (TBR); and one pentaploid: S. curtilobum (CUR). The genomic DNA reads from the twelve genomes were aligned against the DM1-3 potato reference genome v.4.04 (Hardigan et al. 2016) and against the pseudomolecules of the S. chacoense M6 potato reference genome (Leisner et al. 2018). DNA reads from S. chacoense (M6) and S. commersonii (retrieved from NCBI SRA: SRP097632 and SRP050408, respectively) were also aligned against the DM1-3, and DNA reads from S. commersonii (Aversano et al. 2015) were aligned against M6. The S. commersonii genome was not used as a reference as the scaffolds were not long enough. Unaligned, unpaired reads and aligned positions with low-quality scores were removed.
As shown in Fig. 1, overall, more reads of each genome aligned with DM1-3 than with M6, likely because for the M6 analysis only the pseudomolecules were used as reference. The average size of each reference genome that was covered by the aligned reads was 643 Mb and 436 Mb for the DM1-3 and M6 genomes, respectively. The average read depths for each genome ranged from 35.6X (in BUK) up to 50.3X (in GON2). The percentage of the reference genome covered by each of the newly sequenced genomes is shown in Fig. 1. The panel of 12 sequenced genomes covered a minimum of 604 Mb and 416 Mb of the DM1-3 and the M6 reference genomes, respectively. Within the 604 Mb of the DM1-3 genome covered, there are 37,395 genes (97% of the total number of genes). Looking at how much of the genomes in our panel align in common to each of the two reference genomes, the results show a size of 328 Mb of the diploids and 285 Mb of the polyploids when aligned to DM1-3 and 119 Mb of the diploids and 107 Mb of the polyploids when aligned to M6.
The genome alignments against DM1-3 and M6 were used for the identification of sequence-level variations such as single nucleotide polymorphisms (SNPs) and structural variations such as copy number variation (CNV). High levels of CNVs were observed in the 12 sequenced genomes. Some of the regions of CNVs are identical and, thus, conserved among these genomes. The comparison of the diploids to the DM1-3 showed that in the majority of the diploids (with AJH and BUK and the publicly available COM and M6 genomes being the exceptions), the number of genes impacted by deletions is greater than the number of genes impacted by duplications (Supplementary Figure 1A). Interestingly, in AJH, BUK, COM and M6 the number of deletions is greater than the duplications, but the duplications are larger and thus impact a higher number of genes. Additionally, the polyploids also have fewer, but larger duplications resulting in more genes impacted by duplications than by deletions (Supplementary Figure 1A). Furthermore, the comparison of the diploids and the polyploids with the M6 showed that the number of deletions and duplications is similar in number, but the duplications are again found to be larger, resulting in more genes impacted by duplications (Supplementary Figure 1B). Not unexpectedly, the number of genes impacted by duplications is greater in the polyploids than in the diploids. In general, both reference genome comparisons show that the majority of the deletions occur in the intergenic regions, and thus, duplications affect more genes than the deletions (CNVs were more common in the intergenic regions). Finally, there are many more SNPs in the 12 genomes compared to the DM1-3 than compared with the M6, probably because a smaller portion of the M6 genome was available for alignment. Overall, 275 CNV-impacted genes were in common across the panel of 12 sequenced genomes. Out of those, 109 and 166 genes are impacted by duplication and deletion, respectively.
The average size of the genomic regions impacted by CNVs in the diploids is approximately 311 Mb and 314 Mb compared to DM1-3 and M6, respectively. AJH and BUK have the largest CNV-impacted genome region when compared to DM1-3; however, when compared to M6, it is AJH and PHU that have the two largest CNV-impacted regions. For the polyploid genomes, an average of 378 Mb and 333 Mb of CNV-impacted regions is observed when compared to DM1-3 and M6, respectively. JUZ has the largest CNV-impacted region when compared to DM1-3, followed by CUR. When compared to M6, CUR has the largest CNV-impacted region, followed by JUZ.
The heterozygosity of each of the genomes was estimated in percent using the trimmed Illumina reads. As shown in Table 1, the heterozygosity of the diploids ranges between 1.73% (in GON2) and 4.48% (in AJH). The heterozygosity of the polyploids ranges between 3.52% (in ADG1) and 12.02% (in CUR) (Table 1). This indicates that the higher the ploidy, the higher the heterozygosity and that the heterozygosity is greater outside the Stenotomum and Phureja potato groups.
Distribution of single nucleotide polymorphisms detected in the genomes compared to the DM1-3 and M6 reference genomes
The number of SNPs detected compared to the DM1-3 genome ranges from 3.8 million in diploid PHU to 12.9 million in the pentaploid CUR genome (Table 1). The largest number of SNPs detected in the diploids is found in BUK—a wild potato genome—with ~ 7 million SNPs. In the triploids, 6.6 million SNPs are detected in CHA and 10.5 million in JUZ, while the number of SNPs detected in the tetraploids ranges between 7.9 million in ADG1 (7.7 million in ADG2) to 7.1 million in TBR. Moreover, the comparison with M6 demonstrates that the number of SNPs varies between 3.8 million in the diploid PHU up to 8.8 million in the pentaploid CUR. The largest number of SNPs identified in the diploids compared to M6 is 5.6 million (in BUK), in the triploids 8.6 million (in JUZ) and finally in the tetraploids 7.9 million (in ADG1). In summary, the number of SNPs varies between 3.8 million and 10.5 million when compared with DM1-3 and between 3.8 million and 8.6 million when compared with M6 (Table 1).
A total of 96,690 and 373,932 small polymorphisms (SNPs and indels) are found in common between the panel of the 12 genomes: diploids and the polyploids, respectively, while 32,959 are shared among all the ploidy levels. From these, about 65% are in the conserved genome, which is not impacted by any CNVs, and the rest of them in the CNV-impacted genome.
The identified SNPs were annotated with snpEff (Cingolani et al. 2012), and Fig. 2 shows the total number of small structural variations (SNPs, indels) in the intergenic, exonic and intronic regions, respectively. Based on the results of both reference genome comparisons, the majority of the SNPs are found in the intergenic regions representing 44% of the SNPs (about 22% upstream and 22% downstream). About 51% and 48% of the SNPs consist of missense and silent mutations, respectively, while the remaining 2% are nonsense mutations. The number of indels is smaller than the number of SNPs, with a larger amount of smaller deletions than small insertions in both comparisons.
To identify the most heterozygous regions, biallelic loci were identified in the diploid genomes. Sites that had one or more alternate alleles compared to the reference genome were counted as heterozygous sites. The heterozygosity in the genomes is not spread evenly over the genomes, and some chromosomes are more heterozygous than others based on alternate allele frequency (Supplementary Table 1). The most heterozygous regions in the M6 genome compared to the DM1-3 are found on chromosomes 4, 8 and 9 (Leisner et al. 2018), which was also found in our analysis. This confirms the validity of the pipeline used in the present study (assaying a total of 589 Mb in contrast to the 298 Mb that was previously used). When the landrace genomes are compared to DM1-3, most heterozygous regions are found on chromosomes 1 (an average of ~ 11% heterozygous SNPs) (not in M6) and 4 (an average of ~ 10% heterozygous SNPs), even though some genomes also contained heterozygous regions on chromosomes 3, 6, 8, 9, 10 and 12 (Supplementary Table 1). Specifically, GON1, GON2 and PHU are highly heterozygous in chromosome 9 and AJH and M6 in chromosome 4. Chromosome 1 was the most heterozygous for the polyploids.
The same approach was also used for the identification of the highly heterozygous regions in the genomes compared to the M6 genome. Chromosomes 1 and 12 are consistently the most heterozygous for all the genomes regardless of ploidy level (Supplementary Table 1). Additionally, GON1, GON2, PHU and CHA are highly heterozygous in chromosome 6, while AJH, ADG1, TBR and CUR in chromosome 5, BUK and JUZ in chromosome 3, STN and COM in chromosome 11 and, finally, ADG2 in chromosome 7 (Supplementary Table 1). The highly heterozygous SNPs (compared to both reference genomes) are found predominantly in the intergenic regions based on the annotation by snpEff (Cingolani et al. 2012).
The majority of the SNPs identified across both the diploid and polyploid genomes against both reference genomes are biallelic, with the largest proportion in the ADG1 and CUR genomes (98%). Moreover, most of the biallelic SNPs are of type B (biallelic sites with at least one reference allele and at least one alternate allele). Type B constitutes up to 97% of the biallelic alleles in the ADG1 and CUR genomes.
Distribution of structural variations in the landrace genomes compared to the DM1-3 and M6 references shows both polymorphism and synergy
Size of the CNVs detected
The length of the CNVs detected in the genomes varies in size compared to both DM1-3 and M6 reference genomes. However, in general, when compared to the M6 genome, the CNVs are larger than those detected against the DM1-3 genome. For the DM1-3, the average median size of the CNVs in the panel for the diploids is 6.4 kb, slightly larger in the polyploids (7.7 kb), and for all genomes (all ploidy levels) the median CNV size is 7 kb (Supplementary Table 2). The comparison against the M6 follows a similar pattern, although the size of the CNVs is much larger with an average median CNV length 12.5 kb and 13.5 kb for the diploids and polyploids, respectively (Supplementary Table 3).
Duplications are generally larger than deletions for both diploids and polyploids compared against both reference genomes. However, the largest CNVs detected in the genomes compared to DM1-3 are deletions, even though in general the duplications tended to be larger (Supplementary Table 2). In contrast, when the genomes are compared to M6, the largest CNVs detected are duplications (Supplementary Table 3).
Significant gene CNV clusters compared to DM1-3 and M6 reference genomes
To investigate whether large gene clusters were affected with CNVs, the reference genome was split into overlapping bins of 200 kb with a step size of 10 kb, as per (Hardigan et al. 2016). The top three CNV bins identified per genome (Supplementary Table 4, Supplementary Table 5) are not all the same. They involve both duplications and deletions and generally affected disease resistance genes, including those coding for the nucleotide binding site leucine-rich repeat (NBS-LRR) disease resistance proteins. Other CNV-enriched loci contained genes coding for auxin-induced SAURs (small auxin-up RNA), endo-1,4-β-mannosidase and genes of unknown function.
Significant gene CNV clusters in the diploids compared to DM1-3
When compared to the DM1-3 reference genome, the CNV-impacted regions in common between the diploid genomes were mostly impacted by deletions (Supplementary Table 6). Genes coding for proteins of unknown function were found across the regions impacted in common by CNVs. Deletions on chromosome 1 affect genes such as methylketone synthase enzyme, involved in the biosynthesis of the methylketones, produced as plant defense against various herbivorous insects by the trichome glands of wild tomato species (Williams et al. 1980; Antonious 2001; Fridman et al. 2005). Additionally, disease resistance genes impacted by deletions are found on chromosomes 4 and 11 (Supplementary Table 6). The region on chromosome 4 contains the R2 gene, responsible for the resistance against the pathogen Phytopthora infestance (Gebhardt and Valkonen 2001). A cluster of genes coding for leucine-rich repeat (NBS-LRR) disease resistance protein, along with others coding for Tobacco mosaic virus (TMV) protein, is impacted by deletions on chromosome 11 (Supplementary Table 6). Finally, genes responsible for biotic and abiotic tolerance are impacted by deletions on chromosomes 9 and 12 (Supplementary Table 6). Some of these genes code for UDP-glycosyltransferase that glycosylate phytohormones and metabolites as a response to biotic and abiotic stresses (Rehman et al. 2018). For instance, they have been shown to play a significant role during TMV infection (Chong et al. 2002; Le Roy et al. 2016) and resistance against Potato Virus Y (PVY) in tobacco (Matros and Mock 2004). On chromosome 12, deletions impact genes coding for important immunity proteins, such as ubiquitin-conjugating enzyme, RNf5, fiber protein Fb34 and others.
Significant gene CNV clusters in the diploids compared to M6
Similar to the results from the comparison of the diploid genomes to DM1-3, the chromosomes with CNV-impacted genes in common between all the diploid genomes compared against the M6 genome are chromosomes 1, 4, 9 and 11 (Supplementary Table 6). The majority of these genes are impacted by duplications rather than deletions. Genes involved in stress tolerance are duplicated in chromosomes 1, 4 and 9 (Supplementary Table 6). A gene coding for a major facilitator superfamily (MFS) protein is duplicated in all the diploids when compared to the M6 reference. In Arabidopsis, this protein is responsible for drought tolerance (Remy et al. 2013). Similarly, DNAJ genes that were previously found to enhance heat tolerance in transgenic tomatoes (Wang et al. 2019) are duplicated in the diploids, suggesting a possible abiotic tolerance. In pepper, these genes are involved in growth development and are induced by heat stress (Fan et al. 2017). Moreover, genes coding for pentatricopeptide repeat proteins (PPR) are duplicated in the diploid genomes. These were previously shown to have various functions in petunia, including restoring fertility to cytoplasmic male sterility (CMS) lines (Bentolila et al. 2002), and in Arabidopsis, they are involved in salt and drought stress tolerance (Zhu et al. 2012; Lv et al. 2014; Zhu et al. 2014). Duplications in genes coding for serine protease inhibitor (SERPIN) may indicate a defense against insect pests (Jamal et al. 2013). Finally, genes coding for various plant metabolic functions, like 2-oxoglutarate/FE (II)-dependent oxygenase proteins (2OGDs) (Kawai et al. 2014) and others involved in auxin signaling (SAUR genes) (Ren and Gray 2015), are duplicated in the diploids compared to M6 (Supplementary Table 6).
Significant gene CNV clusters in the polyploids compared to DM1-3
The top CNV-enriched gene clusters in the polyploids also included genes coding for SAURs as well as clusters of genes for tolerance to abiotic stress (Supplementary Table 5). Significant CNV gene clusters in common between the polyploid genomes against the DM1-3 genome were identified (Supplementary Table 7). Interestingly, significant CNV gene clusters in common between the tetraploid genomes were found only on chromosomes 1 and 9 (Supplementary Table 7). In the tetraploid genomes, the regions on chromosome 1 coding for S2 self-incompatibility locus 3.2 protein and F-box protein are duplicated. In addition, on chromosome 1 in all the polyploid genomes, genes coding for male sterility proteins are impacted by duplications compared to DM1-3 (Supplementary Table). Genes coding for heat-shock protein, verticillium wilt resistance protein and TMV resistance protein are also duplicated in the polyploids.
Significant gene CNV clusters in the polyploids compared to M6
When compared to the M6 reference genome, all polyploid genomes (ADG1, ADG2, TBR, JUZ, CHA and CUR) have significant CNV-impacted gene clusters on various chromosomes (Supplementary Table 5). All regions have more genes impacted by duplications than impacted by deletions. The significant CNV gene clusters in common between the polyploids and the M6 reference genome were for example SAUR genes (impacted by duplications on both chromosomes 1 and chromosome 11), genes involved in terpene synthase, C2H2 and C2HC zinc finger proteins, as well as the tetraspanins involved in disease resistance. Proteins involved in vegetative growth and development, such as gibberellin 3-oxidase genes, are also impacted by duplications, as are genes involved in metabolic processes and response to stimulus (Supplementary Table 7).
Significant gene CNV clusters in all the landrace genomes compared to DM1-3
With the exception of the triploid JUZ, all of the genomes, regardless of ploidy levels, have a significantly enriched CNV-impacted gene cluster in the 4.6–4.8-Mb region of chromosome 4 compared to DM1-3 (Supplementary Figure 2). This region contains a disease resistance gene cluster that includes genes that code for the R2 late blight resistance protein, which is implicated in the resistance to Phytopthora infestans (Gebhardt and Valkonen 2001). Genes coding for other proteins like EDNR2GH4, EDNR2GH5, EDNR2GH8 and SNKR2GH2 (which are leucine repeat containing proteins) are also detected. In the majority of the genomes, the genes in this region are affected by deletions with an exception in the BUK and M6 genomes, in which the majority of these genes are affected by duplication events.
Significant gene CNV clusters in all the landrace genomes compared to M6
Significantly CNV-enriched gene clusters are detected across all the genomes compared to M6 on chromosomes 1 (64.64–64.82 Mb), chromosome 9 (29.23–29.46 Mb) and chromosome 11 (0.88–1.11 Mb) (Supplementary Figure 3). Two of the three regions (those on chromosomes 1 and 11; Supplementary Figure 3A, 3C) contain SAUR gene clusters. The region on chromosome 9 contains 30 genes coding for 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily (Supplementary Figure 3B). All the genomes have at least 21 of these genes duplicated, with almost all of them (29) being duplicated in the pentaploid CUR genome.
CNV-based classification of 14 potato genomes
To investigate whether the CNVs have an actual impact on the distance or relatedness of the panel of 12 genomes, M6 and COM, a principal component analysis using the CNV status (duplicated, deleted or non-affected) genes was performed. Figure 3 captures that three clusters and two outliers are apparent: ADG1, ADG2, PHU, GON1, GON2, STN and CHA cluster close together, M6 and TBR make one cluster and AJH, CUR, and JUZ, the bitter potatoes, cluster together, while the two wild species, COM and BUK, are outliers on opposite sides of the graph. Since this largely reflects current taxonomy views, and since a SNP-based phylogenetic analysis was not trivial (because of ploidy and heterozygosity), a phylogenetic analysis was performed with the same CNV-affected gene data as used for the PCA. Figure 4C shows the CNV status-based phylogenetic tree constructed with discrete characters indicating the three statuses of the genes (copy number deleted, duplicated and not impacted). As with the PCA, the GON, PHU, STN and ADG genomes cluster together with CHA close. The BUK and COM are the outliers, yet it is interesting that they map between the bitter genomes (AJH, JUZ, CUR) and the other cultivated taxa.