Abstract
Although originally primarily a system for functional biology, Arabidopsis thaliana has, owing to its broad geographical distribution and adaptation to diverse environments, developed into a powerful model in population genomics. Here we present chromosome-level genome assemblies of 69 accessions from a global species range. We found that genomic colinearity is very conserved, even among geographically and genetically distant accessions. Along chromosome arms, megabase-scale rearrangements are rare and typically present only in a single accession. This indicates that the karyotype is quasi-fixed and that rearrangements in chromosome arms are counter-selected. Centromeric regions display higher structural dynamics, and divergences in core centromeres account for most of the genome size variations. Pan-genome analyses uncovered 32,986 distinct gene families, 60% being present in all accessions and 40% appearing to be dispensable, including 18% private to a single accession, indicating unexplored genic diversity. These 69 new Arabidopsis thaliana genome assemblies will empower future genetic research.
Similar content being viewed by others
Main
Genome rearrangements can dramatically impact genetic diversity, phenotypes1,2, recombination3,4,5,6,7,8 and thus local adaptation and evolution9,10,11,12. The whole-genome alignment of complete genome assemblies, which can be achieved using long-read Oxford Nanopore Technologies (ONT) and PacBio high-fidelity (HiFi) technologies, facilitates the identification of large and complex structural variants (SVs)13,14. Pan-genomes, which aggregate multiple genomes covering the diversity of a given species, provide greater insights into the overall genetic diversity compared to using a single reference and have allowed researchers to determine the natural phenotypic variation of a species15,16,17,18, as recently shown in plant and animal species, including soybean19, tomato20,21, potato22, rice23,24,25, maize26, barley27, wheat28, apple29, silkworm12 and human30,31.
The first genome sequence of Arabidopsis thaliana (Col-0) was released in 2000 (ref. 32) and has greatly boosted plant biology and breeding research. This assembly was based on the sequencing of bacterial artificial chromosomes using Sanger technology32 and (with some updates) has served as a reference genome until today. Based on this reference genome, several studies applying whole-genome resequencing arrays or short-read sequencing on thousands of worldwide natural accessions of A. thaliana (in particular, Africa, Eurasia and North America) have unraveled natural genomic variations, including single-nucleotide polymorphisms (SNPs), small indels and SVs. This, in turn, revealed the evolutionary history, divergence and adaptation of A. thaliana at the macro- and micro-evolutionary scales33,34,35,36,37,38,39,40. However, only a limited number of large and complex SVs were reported in Arabidopsis, including the well-characterized ~1.2 Mb inversion on chromosome 4 between Col-0 and Ler5,8,41 and a ~2.5 Mb inversion on chromosome 3 between Col-0 and Sha3, which was captured through chromosome-level assemblies of a few A. thaliana accessions3,5,42,43,44,45. A first Arabidopsis pan-genome analysis used the whole-genome assembly of seven accessions based on PacBio CLR, providing a first glimpse into the diversity of this species3. Recently, a study using PacBio HiFi to assemble the genomes of 32 accessions showed that SVs can play an important role in local adaptation46.
In this Article, we de novo assembled 69 reference-quality Arabidopsis genomes either using PacBio HiFi or Oxford Nanopore long-read technologies, including accessions from Europe, the Middle East, Asia, Africa, Madeira and North America. With such a geographic spread, we aim to capture and describe most of the diversity in Arabidopsis genomes worldwide and constitute a comprehensive resource for future studies bridging phenotypes and genotypes.
Results
Genomic relationship of 72 A. thaliana accessions
We selected 72 Arabidopsis accessions across the global species distribution (Fig. 1a and Supplementary Table 1) and examined their genetic diversity and relationships through SNP analysis. Principal component, phylogenetic tree and admixture analyses showed that global genetic relationships between the 72 genomes broadly mirrored their geographic origins, as previously shown in this species35 (Fig. 1a,b). We identified four major genetic groups and named them after the geographic origin of the majority of their members: ‘Europe’ (35 accessions, including 4 accessions from recently colonized North America47), ‘Africa’ (13 accessions, including accessions from the Mediterranean rim), ‘Madeira’ (5 accessions exclusively from Madeira) and ‘Asia’ (16 accessions distributed from Eastern Europe to Japan). In addition, we identified three accessions (Nemrut-1, Dog-4 both from Turkey and Can-0 from Spain) that did not cluster in distinct groups but were labeled ‘admixed’ and originated from geographic regions between the distinct groups (Fig. 1a–c and Extended Data Fig. 1). We also found that Tsu-0, which is labeled as an accession from Japan, clustered together with the European accessions. This is consistent with previous reports48,49 suggesting that Tsu-0 was mislabeled, and we treated it as belonging to the ‘Europe’ genetic group from thereon.
To infer the evolutionary relationships among A. thaliana accessions, we constructed a species tree of the accessions (see below for further detail in pan-genome analysis, Supplementary Fig. 1), which confirmed their evolutionary relationships and was consistent with previous reports highlighting the African populations as the most divergent and probably most ancient lineages36.
Chromosome-level assemblies of 69 A. thaliana accessions
We generated genome assemblies for each of the 72 accessions using a combination of long-read (48 accessions with PacBio HiFi with a mean depth of 45×, and 24 accessions with Oxford Nanopore with a mean depth of 67×) and short-read sequencing, reference-guided scaffolding and manual curation (Methods). The quality of the assemblies was analyzed in six different aspects across the assemblies (Supplementary Tables 2–7, Extended Data Fig. 2 and Supplementary Figs. 2 and 3). Sixty-nine accessions were confirmed to be inbred lines, but Lu-1, Pa-1 and Istisu-1 showed signs of heterozygosity and thus were removed from the subsequent analysis (Methods). The remaining contig assemblies featured N50 values from 6.1 to 21.3 Mb with a mean of 13.3 Mb, and were scaffolded to the chromosome level (Fig. 1d and Supplementary Table 4).
The assembly sizes of the 69 accessions ranged from 128 to 148 Mb, with an average length of 135 Mb (Fig. 2a). Previous estimates of genome size variation in A. thaliana using flow cytometry were generally higher, ranging from 161 Mb to 184 Mb (ref. 50) (180 lines from Sweden, and the estimation for Col-0 was 166 Mb). However, genome size estimations based on flow cytometry are known to suffer from overestimation51. More recent estimates derived from k-mer analyses based on short read resequencing data of 89 accessions, indicated a range from 138 Mb to 175 Mb (ref. 51). This variation in genome sizes was found to be largely determined by 45S ribosomal DNA (rDNA) copy number variation50. We also used k-mers to estimate the genome sizes of our 69 accessions, and compared them to their assembly sizes (Fig. 2a). Some of the ONT read-based assemblies were substantially shorter than their estimated genome sizes, as their rDNA arrays and centromeres were not fully assembled (Fig. 2a and Extended Data Fig. 2).
With the aim of deciphering the underlying genomic features contributing to the variation in genome size, we selected the 46 most complete assemblies (42 accessions assembled with HiFi, and 4 accessions, Bur-0, Ge-0, Jea and Nok-1, assembled with ONT) based on both (i) the ratio of assembly and k-mer-based genome size estimation and (ii) the ratio of centromere repeat length and read coverage-based centromere size estimation (Fig. 2a and Extended Data Fig. 2). The assembly sizes of these accessions ranged from 130 to 148 Mb. Their centromeric repeat arrays were on average 14 Mb long (across all five chromosomes) ranging from 10 to 22 Mb and were highly correlated with assembly sizes (Pearson’s correlation r = 0.93, P < 2.2 × 10−16, Fig. 2b, Extended Data Fig. 2 and Supplementary Table 7). This showed that the variation of the size of the centromeric arrays in Arabidopsis, as recently described by Wlodzimierz et al.52, is a major contributor to the variation of genome size. Transposable elements (TEs) were annotated with a pan-TE library generated from initial TE annotations of the individual assemblies. The size of TE space (that is, genomic regions with similarity to TEs) was surprisingly similar between the genomes, with a mean length of 16.1 Mb, ranging from 15.2 to 17.6 Mb (Fig. 2d,e). Among them, long terminal repeat (LTR) retrotransposons (Copia, Gypsy and LTR unknown) and Helitrons made up the largest TE fractions and constituted 6.4% and 3.5% of the genome, respectively (Fig. 2d). Accordingly, the variation in the size of TE space between the accessions was moderately correlated with the total assembly size (Pearson’s correlation r = 0.42, P = 0.003, Fig. 2e). This suggests that genome size variation in Arabidopsis (excluding the variation of 45S rDNA size) is mostly dominated by centromeric repeat length and that TEs are only minor contributors. This is in sharp contrast to the situation between plant species, where the main determinants of genome size variation are ploidy levels and TE content53. In species with high TE content, such as rice23, TE space variation can contribute more largely to intraspecific variation in genome size. Interestingly, however, even though cumulative centromere size determines genome size in A. thaliana, the sizes of individual chromosomes were only weakly or not correlated in size (Pearson’s correlation r = 0.2, P = 0.171, Fig. 2c and Supplementary Figs. 3 and 4), indicating that the sizes of individual centromeres evolve independently from each other.
A quasi-fixed karyotype across the A. thaliana species range
Chromosome-level genome assemblies allow accurate analysis of large-scale genomic rearrangements and genome colinearity13. Using pairwise whole-genome alignments, we found that the chromosome arms hardly contained any major rearrangements and were highly syntenic across all genomes, even when comparing genomes from distant parts of the world (Fig. 3, Supplementary Figs. 5–9 and Extended Data Fig. 3). Large insertion/deletion polymorphisms are absent from chromosome arms, the vast majority being smaller than 20 kb and the largest reaching ~55 kb (Extended Data Fig. 3). Inversions along chromosome arms are also rare, but are larger than insertions/deletions, with a few cases above one megabase. We identified a total of seven inversions on chromosome arms, almost all present in single accessions (Fig. 3, Supplementary Figs. 5–9 and Extended Data Fig. 3): a ~2.4 Mb inversion on chromosome 3 in Shahdara, a ~2.3 Mb inversion on chromosome 5 in Zal-1, a ~2.2 Mb inversion on chromosome 1 in N13, a ~1.8 Mb inversion on chromosome 4 in Ws-4, a ~1.2 Mb inversion on chromosome 4 in Stw-0 and a ~1 Mb inversion on chromosome 2 in Ge-0 (validated by long read alignment, Extended Data Fig. 4 and Supplementary Figs. 10–14). A notable exception to this was the well-described ~1.2 Mb inversion on chromosome 4 (refs. 5,41), which was observed in eight accessions (including Col-0) and which partially overlapped with the heterochromatic, pericentromeric regions. This inversion is found in the ‘Europe’ genetic group among geographically distant accessions (for example, Yo-0 from North America and Ws-4 from Belarus), thereby suggesting a long-lived segregation of this particular inversion.
The karyotype of A. thaliana is quite different to the estimated ancestral karyotype of the Brassicaceae, which is still conserved in the sister species Arabidopsis lyrata54,55 and involved a species-specific deletion of three functional centromeres along with major chromosomal rearrangements and fusions56. The high structural similarity between the 69 genomes implied that the derived karyotype of A. thaliana arose during or shortly after speciation and was maintained virtually unchanged during the global spread of this species that colonized contrasted ecological habitats.
In contrast to the chromosome arms, large rearrangements of different types were highly abundant in and near the centromeres and as a result led to numerous different centromeric haplotypes (Fig. 3, Supplementary Figs. 5–9 and Extended Data Figs. 3 and 5).
To quantify colinearity with higher resolution, we measured colinearity in sliding windows along the chromosomes and calculated average pairwise diversity in synteny of each of the newly assembled genomes against a recent genome assembly of Col-0 (ref. 57) (Col-PEK), which includes most of the centromeric regions (Fig. 4a). Average pairwise diversity in synteny ranges from 0 to 1, with 1 denoting complete absence of colinearity between the genome of a group and 0 indicating colinearity among all the genomes. Overall, for the 69 genomes, around 50% of the genome was highly colinear, with an average pairwise diversity in synteny lower than 0.2, which was exclusive to the chromosome arms (Fig. 4a and Supplementary Fig. 15). In contrast, ~33% of the genome was highly diverse with an average pairwise diversity in synteny of over 0.5, mostly including regions in and near the centromeres (Fig. 4a and Supplementary Fig. 15). We observed transitions between regions with very high and very low synteny in the peri-centromeres, which covered several Mb around each of the centromeres.
The broad colinearity in the chromosome arms showed a few interesting exceptions. The short arms of chromosomes 2 and 4, where rDNA clusters and nucleolus organizer regions are located, showed high levels of rearrangements, like the lower arms of chromosome 1 (~25 Mb) and 5 (~20 Mb) where large and highly diverse resistance R gene clusters reside (Fig. 4a)58. Also, the ~1.2 Mb inversion on chromosome 4 was marked by high diversity in synteny consistent with the fact that it segregates in several groups (Fig. 4a). In addition, we found individual spikes of high structural diversity in regions that were otherwise highly colinear. As previously described, such local hotspots of rearrangements were enriched for R gene clusters3. For example, at ~30 Mb on chromosome 1, synteny was broken by the high and variable copy number changes in a nucleotide-binding and leucine-rich repeat (NLR) gene cluster (Fig. 4b).
In addition, pairwise genome-wide colinear relationships reflected the genetic and geographical groups (Fig. 4c–e and Supplementary Fig. 16), implying that structural differences can recapitulate our genetic grouping based on SNPs (Extended Data Fig. 1 and Supplementary Fig. 17), which was also recently shown with read alignment-based SV calls38. Small subgroups of accessions exhibited increased colinearity. These corresponded to geographical clusters, including the Japanese accessions (Hiroshima and Ishikawa), the North American accessions (Yo-0, RRS10 and Tul-0) and the Madeiran accessions (Are-1, Are-2, Are-6, Are-10 and Rab-R1). Interestingly, we also found that the colinearity among African accessions was lower than the colinearity between European or Asian accessions (Fig. 4c–e), probably reflecting the higher genomic diversity in Africa36.
The pan-genome of A. thaliana
We annotated 27,246 to 28,989 protein-coding genes in each of the 69 assemblies (Fig. 5a), compared to 27,445 genes in the reference sequence59. To unravel the gene repertoire of A. thaliana, we clustered all 1,928,005 genes combined with the gene sets of the reference sequence (Col-0 and Araport11), an additional recent telomere-to-telomere (T2T) Col-0 assembly (Col-PEK) and the reference sequences of the sister species A. lyrata and Capsella rubella, which served as outgroups. In total, we identified 36,991 gene families across the 73 genomes, including 13,328 single-copy gene families that were used for constructing the phylogeny of the A. thaliana accessions (Supplementary Fig. 1).
Excluding the genes from the reference genome, the T2T Col-0 accession, A. lyrata and C. rubella, we found that 32,986 gene families included genes from at least one of the 69 A. thaliana genomes. Of those, 19,721 (60%) were present in all 69 genomes and were defined as core gene families. Among the gene families that appear nonessential, 1,613 (5% of total) were present in 63–69 accessions (>90% of the accessions), defined as softcore; 5,582 (17%) were present in 2–62 genomes, defined as dispensable; and the remaining 6,070 (18%) gene families were present in only one genome, defined as private gene families (Fig. 5b and Extended Data Fig. 6). On average, a single genome consisted of 86.8% core, 6.7% softcore, 6.1% dispensable and 0.4% private genes (Fig. 5a). As expected, the increased size of our collection reduced the number of core gene families as compared to a previous estimation, which was based on eight assemblies only3. However, our core gene family number estimate was similar to one of a recent study where 21,545 gene families were annotated as core gene families among 32 accessions46, indicating that our estimation probably reflects the actual core gene set of A. thaliana. In contrast, however, we found a much higher number of private genes in the genomes (18.4% of all gene families) as compared to the previous study. To understand the origin of the private genes, we explored local sequence homology and structure of all gene families, and found that 6,954 gene families were formed by differences in annotation, where gene models are split or merged differently in association with local polymorphisms (Methods). The split–merge cases represent 2.7%, 11%, 26.2% and 78.9% of the core, softcore, dispensable and private gene families, respectively (Fig. 5b, pastel colors). Even though these gene families could result from errors in the annotation, split–merge cases may also represent differences in the transcriptomes with functional consequences. For all the remaining 1,281 private gene families, no sequence similarity can be detected in the other accessions, suggesting that they represent de novo evolved genes.
We found that the length of the encoded protein sequences of the core genes was longer and more often matched Pfam domains as compared to other types of gene (Mann–Whitney test between core and softcore genes, P < 2.2 × 10−16, Fig. 5c-d). Gene expression across 79 organs and developmental stages (measured in Col-0 (median ≥1 transcript per million (TPM))), revealed a much higher fraction of expressed genes with significantly higher expression levels in the core and softcore gene families as compared to the private genes (even though 30.8% of the Col-0 dispensable genes (488/1,584) and 10% of the Col-0 private genes (3/31) were still expressed) (Mann–Whitney test between core and softcore genes, P < 2.2 × 10−16, Fig. 5e). Moreover, core genes showed significantly lower nonsynonymous/synonymous substitution ratios (Ka/Ks) than accessory genes, suggesting that core genes were more functionally constrained than accessory genes (Mann–Whitney test between core and softcore genes, P < 2.2 × 10−16, Fig. 5f). This is corroborated by the functional categories of the different types of gene family that were analyzed with a Gene Ontology (GO) enrichment analysis. Core genes were enriched for basic and essential biological processes, including metabolic, cellular, developmental processes, reproduction and regulation of biological process (Supplementary Fig. 18 and Supplementary Table 8). Among them, the 27 meiosis-specific genes60 were all highly conserved, and were present in the same copy numbers (26 in one copy, and 1 in two copies) across the 69 accessions (Supplementary Table 9). Accessory genes (softcore, dispensable and private categories, analyzed independently or together) were enriched for biological processes such as cell killing and defense response (Supplementary Fig. 19 and Supplementary Table 10). Altogether, while this suggests that the accessory genes are substantially different from the core genes, it also suggests that a fraction of those have functional features and could contribute to phenotypic diversity and adaptation.
Finally, we measured the sizes of the pan-gene sets of the 69 A. thaliana accessions by subsetting the genomes (Fig. 5g and Supplementary Fig. 20). Even after adding all genomes, the pan-genome gene set did not reach a plateau, indicating that our set of accessions did not capture the entire complement of diverse gene families in this species. The reason for this was the high number of gene families that were only present in one or a few of the accessions (‘dispensable’ and ‘private’ gene families), hinting at an untapped genetic diversity.
Discussion
In this study, we have generated 69 reference-quality genome assemblies, which capture a large degree of the genetic diversity in A. thaliana. These assemblies were generated from accessions selected from Central Africa to Iceland and from North America to Japan, but despite these huge geographical distances, genome structure was highly conserved among plants. The assemblies also revealed a total of 10,420 novel protein-coding gene clusters, which are absent from the reference genome (Col-0 and Araport11) and provide a very powerful resource to study the genetic basis of hitherto undescribed variation. In addition, our collection of genome assemblies contains the parental strains of powerful and publicly available material such as recombinant inbred lines61,62. Genome assemblies will help to unravel the genetic basis of important traits relying on complex structural variations63,64,65,66. Finally, these 69 genomes, together with others, provide a great resource to study the mechanisms of genome dynamics, including recombination. These resources pave the way for further functional genomic investigation.
Methods
Plant material and whole-genome sequencing
We ordered the seeds of 66 accessions from the Versailles Arabidopsis Stock Center at Jean-Pierre Bourgin Institute and received the seeds of the remaining 6 accessions (Elh-2, Ice-1, Rab-R1, Tanz-1, Taz-0 and Zin9) from Angela Hancock (MPIPZ). The accession numbers of seeds from Versailles are presented in Supplementary Table 1. Plants were grown in greenhouses or growth chambers.
For the 48 accessions that were sequenced with PacBio HiFi, high-molecular-weight (HMW) DNA was prepared from a pool of 30–40 4-week-old plants using the NucleoBond HMW DNA kit. The HiFi libraries were prepared using the SMRTbell prep kit 3.0 and the BluePippin cartridge for enriching fragments greater than 9 kb up to 50 kb. Finally, HiFi sequencing was performed on the Sequel IIe platform at the Max Planck Genome-centre Cologne. SMRTlink software (PacBio) was used to demultiplex and extract HiFi datasets. DNA form a single plant leave was used to prepare PCR-free short-read libraries according to the protocol of the NEBNext Ultra II DNA PCR-free Library Prep Kit for Illumina (New England Biolabs). Libraries were then sequenced with 2 × 150 paired-end reads on the NextSeq 2000 platform.
As previously described67, for the 24 accessions that were sequenced with Oxford Nanopore, HMW DNA was extracted from 3-week-old plants according to the protocol described by Russo et al.68. The subsequent library preparation and sequencing were performed at the GeT-PlaGe core facility, INRAE Toulouse. ONT libraries were prepared using the EXP-NBD103 and SQK-LSK109 kits according to the manufacturer’s instructions and using 4 µg of 40 kb-sheared DNA (Megaruptor, Diagenode) as input. Pools of six samples were sequenced on one R9.4.1 flowcell. Between 14 and 20 fmol of library was loaded on each flowcell and sequenced on a PromethION instrument. Illumina libraries were prepared using the Illumina TruSeq Nano DNA HT Library Prep Kit. Libraries were then sequenced with 2 × 150 bp paired-end reads on the Hiseq3000 platform.
De novo genome assembly
Genome heterozygosity and size were estimated using Jellyfish v2.2.6 (ref. 69) and findGSE51.
For the 48 accessions that were sequenced with PacBio HiFi, the initial de novo assembly was performed using three different assembly tools: Canu v2.1.1 (ref. 70), Flye v2.7 (ref. 71) and Hifiasm v0.16.1 (ref. 72). Then, purge_dups v1.2.5 (ref. 73) was used to purge haplotigs and overlaps in the assemblies from Canu and Hifiasm. To improve contiguity, we combined the assemblies derived from the three assemblers using quickmerge v0.3 (ref. 74). First, the assembly with the best quality (longest contiguity measured as N50, assembly size and correctness, centromere coverage and so on) was selected as query, and the assembly with second longest N50 was used as reference to join contigs in the query assembly (Supplementary Table 2). The resulting assembly was further improved by using the third assembly as reference. Then, we used a homology-based scaffolding tool, RaGOO v1.1 (ref. 75), to order and orient contigs on the basis of whole-genome alignments to the Col-CEN genome76. Manual evaluation and correction were performed based on the whole-genome alignment and position of centromeric repeats. Finally, to close the gaps in scaffolds, we ran four rounds of MaSuRCA v4.1.0 (ref. 77) by using the HiFi reads and the three assemblies individually.
For the 24 accessions that were sequenced with Oxford Nanopore, the long reads were filtered for adapters, short (<1 kb) or low-quality reads (mean quality >70) using Porechop v0.2.4 (https://github.com/rrwick/Porechop) and Filtlong v0.2.1 (https://github.com/rrwick/Filtlong). De novo assembly of each genome was initially performed using SMARTdenovo (https://github.com/ruanjue/smartdenovo)78 and Flye. To fix base errors in the initial assemblies, we polished the genome by running three rounds of Racon v1.4.10 (ref. 79) with long reads, and four rounds of NextPolish v1.3.1 (ref. 80) with short reads. Then, assemblies were purged using purge_dups. Similar as for the process described above for HiFi datasets, assemblies were further improved, corrected and scaffolded (Supplementary Table 3). Finally, scaffolds were polished using NextPolish.
Genome assembly evaluation
To evaluate the completeness of each genome assembly, compleasm v0.2.2 (ref. 81) was used with the OrthoDB database brassicales_odb10 (brassicales, 2020-08-05). We evaluated the consensus quality value (QV) and completeness of genome assemblies on the basis of the k-mers spectrum of Illumina whole-genome sequencing reads, using Merqury v1.3 (ref. 82) with default parameters, to estimate the goodness and completeness of the reference protein-coding genes (TAIR10 and Araport11) in each genome assembly. The reference genes were aligned against the genome assemblies using blastn83, and reference genes were considered well assembled in genome assemblies which were aligned with identity ≥80 and coverage ≥0.9. We also used Liftoff v1.6.3 (ref. 84) to ‘lift over’ the reference genes to the 69 genome assemblies, with parameters ‘-copies -sc 0.90 -polish’. Additionally, to evaluate the assembly continuity, LTR Assembly Index (LAI) was calculated for each genome assembly using LTR_retriever v2.9.0 (ref. 85).
Centromeric and telomeric repeats were annotated using Bowtie2 v2.4.4 (ref. 86) (-a–very-sensitive) to search for the consensus sequence of the 178-bp and 7-bp repeat motifs in each genome assembly, respectively. To estimate the copy number and length of centromeric repeats, we compared the sequencing depths obtained from aligning Illumina short reads against the genome assembly and concatenated sequence of four copies of the repeat motif, using Bowtie2 (-k 1), separately. The sequencing depth was calculated using samtools v1.9 (ref. 87) with the parameter setting ‘-Q 1 -d 0 -a’.
We evaluated the assembly quality in the following six aspects, and found that (1) the assembled genome size is comparable to the length of T2T assembly of Col-0 reported by recent studies57,76,88 (Fig. 2a, Supplementary Figs. 3 and 21, and Supplementary Table 4); (2) the completeness estimated by Benchmarking Universal Single-Copy Orthologs was 99.8%, which is comparable to the Col-0 reference and T2T genomes (Supplementary Fig. 2 and Supplementary Table 4); (3) on the basis of the k-mer-based estimation, the assembled genomes showed a mean of 53.4 QV and 98.5% completeness (Extended Data Fig. 2 and Supplementary Table 4); (4) the analysis of completeness of reference protein-coding genes by homology search against the assembled genomes (BLASTP and Liftoff), a mean of 97.3% and 98.4% were successfully assembled (Supplementary Table 5); (5) on the basis of the LAI (mean of 22), which reached the ‘gold standard’ level (LAI >20) (Supplementary Table 6); (6) on the basis of the completeness of centromere repeats (mean of 96%), estimated using the Illumina short-read dataset (Extended Data Fig. 2 and Supplementary Table 7). Three accessions, Lu-1, Pa-1 and Istisu-1, had lower values in the measurement of QV and k-mer completeness. Further alternative allele frequency analysis indicated the presence of heterozygosity, which are unexpected in a pure line, probably due to the mixture of two distinct lineages or segregation of polymorphism in the population. These three accessions were ignored in subsequent analyses (Supplemental Fig. 28). These results suggested that the quality of all 69 genome assemblies was comparable to that achieved by the Col-0 reference and T2T genome assemblies, indicating high continuity and completeness.
Annotation of repetitive elements
To annotate the TEs in the 69 genome assemblies, we first generated a nonredundant TE library for each accession, using the Extensive de novo TE Annotator (EDTA) v2.0.1 (ref. 89) with parameters ‘–overwrite 1–sensitive 1–anno 1–evaluate 1’. Then, all the individual TE libraries were combined to construct a pan-TE library using panEDTA90. RepeatMasker v4.1.1 (http://www.repeatmasker.org) was further employed to re-annotate the repeat regions with parameters ‘-q -div 40 -cutoff 225’ and the pan-TE library.
Gene prediction and annotation
Protein-coding genes were annotated on the basis of a strategy that integrated ab initio gene prediction, transcriptome-based de novo transcript assembly and homologous protein sequence alignment. First, four ab initio gene prediction tools were used: Augustus v3.3.3 (ref. 91), GeneMark v4.62 (ref. 92), GlimmerHMM v3.0.4 (ref. 93) and SNAP (version 2006-07-28)94. Second, we collected a list of 308 public RNA sequencing (RNA-seq) datasets for 28 accessions from the NCBI SRA database (Supplementary Table 11). The quality of short reads was checked with FastQC. Trimmomatic v0.39 (ref. 95) was used to remove potential adapter and low-quality sequences, with parameters ‘LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36’ for single-end reads and ‘LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36’ for paired-end reads. To obtain the protein sequence of the transcript, the reads were then processed with Trinity v2.14.0 (ref. 96) to assembly transcript sequences, and TransDecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder) to identify candidate coding regions. The predicted longest open reading frames were searched against the UniPort database97 using DIAMOND v2.0.4 (ref. 98) with the parameter setting ‘-k 1 -f 6 -e 1e-5–ultra-sensitive’, and Pfam database99 using hmmsearch from HMMER v3.1b2 (http://hmmer.org). Then, we merged all the predicted protein sequences to generate the pan-pep library and selected representative sequences using CD-HIT v4.6.8 (refs. 100,101) (-c 0.98). Third, protein sequences of A. thaliana (447_Araport11), A. lyrata (384_v2.1), Oryza sativa (323_v7.0) and Solanum lycopersicum (514_ITAG3.2), which were downloaded from Phytozome v13 database102 and the pan-pep library, were aligned to each genome assembly using Exonerate v2.2.0 (ref. 103) (–percent 70–minintron 10–maxintron 60000). Finally, all different evidences of gene models were integrated using EVidenceModeler v1.1.1 (ref. 104). The resulting gene models, especially for those from nonscaffold contigs, were further evaluated by comparing to the National Center for Biotechnology Information (NCBI) nonredundant (NR) database using DIAMOND, outside which Brassicaceae proteins were excluded.
Noncoding genes were annotated by integrating the predictions from Barrnap v0.9 (https://github.com/tseemann/barrnap), Infernal v1.1.4 (ref. 105) (Rfam database v14.8) and tRNAscan-SE v2.0.9 (ref. 106). Noncoding RNA and TE-related genes were identified by checking alignment/overlap between predicted gene models and TE sequences (TAIR10), representative gene models (Araport11), TE genes (Araport11), TE and noncoding RNA annotations of each assembly.
Disease resistance genes were identified by using NLR-Annotator v2.1 (ref. 107) and RGAugury108. NLRs have been reported to be mostly present in pairs or cluster23,58. Pair NLRs were defined as fewer than two non-NLR genes between the two NLRs. Cluster NLRs were defined as more than two NLRs with fewer than two non-NLR genes between any two NLRs. The remaining NLRs were labeled as singletons.
The resulting gene models were further annotated functionally using InterProScan v5.59-91.0 (ref. 109) (parameters: -f TSV -t p -iprlookup -goterms -pa). The GO enrichment analysis was performed using AgriGO v2.0 (ref. 110).
Gene-based pan-genome construction and analysis
All the protein-coding genes from the 69 assembled genomes, representative protein-coding genes from Col-0 TAIR10 (refs. 32,111), Col-PEK57 and two out-species, A. lyrata (384_v2.1) and C. rubella (474_v1.1)102, were clustered using OrthoFinder v2.5.4 (ref. 112) with the parameter setting ‘-S diamond_ultra_sens’, resulting 37,921 gene clusters for 73 genomes. We used Liftoff to ‘lift over’ all the predicted protein-coding genes (including genes from the Col-0 TAIR10) to each of the 69 genome assemblies, with parameters ‘-copies -sc 0.90 -polish’. The gene locus, that is, low allele frequency, in each genome was checked for the presence of homologous genes (95% coverage for both query and hit genes, and reciprocal best hit) from the other accessions, and then, the related orthogroups were fused. Furthermore, the potential split–merge cases were also evaluated for the alignment and coverage between the query gene and representative gene (across the 69 accessions) or TAIR10 reference genes. After correction, we obtained 36,991 gene clusters for 73 genomes, and 32,986 gene clusters for the 69 assembled genomes. The orthologous groups were then classified into four categories: core gene clusters were defined as the genes shared between all the 69 genomes; softcore gene clusters were present in more than 90% of accessions (63–68); dispensable gene clusters were found in more than one accession (2–62); and private gene clusters that were accession specific. To estimate the pan-genome and core-genome size (the number of gene families defined by OrthoFinder), we carried out 2,000 random samplings of accessions for each number of sample size (ranging from 2 to 67) from the 69 accessions.
Detection of SNPs and indels
The Illumina whole-genome sequencing short-reads of the 72 accessions (with mean depths of 43×) were aligned against the Col-PEK genome by BWA v0.7.15-r1140 with default parameters, and duplicated reads were removed using samtools. Then, SNPs and small indels (ranging from 1 to 20 bp) were detected and filtered for each genome assembly and merged by inGAP-family113. The resulting variants were further processed by VCFtools v0.1.16 (ref. 114) to obtain the high-quality and informative SNP list (parameters: ‘–maf 0.05–max-missing 0.2–min-alleles 2–max-alleles 2–min-meanDP 6–max-meanDP 226’). SNPs that were located in the region of centromeric repeats and TEs were removed. We found a total of 7,056,033 SNPs, among which 2,254,527 were common and located in noncentromeric and non-TE regions, with a density of one SNP per 59 bp.
SV detection and analysis
To fully take advantage of the 69 high-contiguity genome assemblies, we performed the whole-genome alignment against the Col-PEK genome using minimap2 v2.21-r1071 (refs. 115,116) (parameters: -ax asm5–eqx), and SyRI v1.6 (ref. 13) was applied to identify SVs with default parameters. SVs (20 bp to 10 kb) from the 69 genome assemblies were merged by SURVIVOR with the parameters ‘1000 1 1 0 0 1’. SVs longer than 10 kb from 69 accessions detected by SyRI were retained and merged by SURVIVOR (parameters: ‘20000 1 1 0 0 1’).
Population and phylogenetic analysis
For the SNPs in the 72 A. thaliana accessions, linkage pruning was performed by PLINK v1.90b6.18 (ref. 117), with the parameters ‘–indep-pairwise 50 10 0.1’. Then, principal component analysis (PCA) was conducted by PLINK, which showed that PC1 splits Madeira and Africa accessions from the North America–Europe–Asia group, while PC2 further divides Asia accessions from the others (Fig. 1b). We performed population structure inference using ADMIXTURE v1.3.0 (ref. 118), with the number of population settings ranging from 2 to 5. For K = 2, we found a division between Africa and the other accessions. When K = 4, we saw a new subgroup (Madeira) within the Africa group. When K = 5, the subgroup Sicily and Lebanon emerged within the Africa group (Fig. 1c).
To build the phylogenetic tree of the 69 accessions, a total of 13,328 single-copy orthologous gene clusters were selected and used for generating the amino acid alignment using MUSCLE v3.8.31 (ref. 119) with default parameters. For the amino acid alignment of each ortholog group, the nucleotide sequence that corresponds with the amino acid sequence was extracted by seqkit v2.3.0 (ref. 120) with default parameters, and then the coding sequence (CDS) alignment was generated by PAL2NAL v14 (ref. 121) with the parameter ‘-nomismatch’. Then, a concatenated super matrix of the 13,328 ortholog-based CDS alignment was constructed with different partitions defined corresponding to different gene clusters. The super matrix and partition definition were used for building a maximal likelihood tree using IQ-TREE v1.6.12 (ref. 122) (parameters: -m MFP -bb 1000 -alrt 1000 -redo -safe). The SNP list of the 72 accessions was converted into FASTA format using vcf2phylip v2.8 (https://github.com/edgardomortiz/vcf2phylip), and then was taken by IQ-TREE to reconstruct the evolutionary tree (parameters: -m GTR + ASC -bb 1000 -alrt 1000 -redo -safe). The gene presence–absence variation matrix (phylip format) was generated and used for tree building by IQ-TREE (parameters: -st MORPH -m MK + ASC -bb 1000 -alrt 1000 -redo -safe).
To estimate population recombination rates (ρ = 4Ner, where Ne is the effective population size and r is the recombination rate of the window), we used FastEPRR v2.0 (ref. 123) with 100-kb nonoverlapping window size. The nucleotide diversity of each site was calculated by VCFtools.
To calculate the Ka, Ks and Ka/Ks of each orthologous gene pair (A. lyrata as the reference), the amino acid sequences were aligned using MUSCLE, transformed into CDS alignments with PAL2NAL, and then fed into the KaKs_Calculator v2.0 (refs. 124,125).
Gene expression analysis
The RNA-seq dataset from a previous study126, including 79 organs and developmental stages of A. thaliana, was downloaded from the NCBI SRA database. First, the potential adapter and low-quality sequences were identified and removed by Trimmomatic v0.39 (ref. 95), with parameters ‘LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36’. Then, HISAT2 v2.1.0 (refs. 127,128) was used to align the clean reads against the Col-PEK genome. Gene expression was normalized as TPM, which was calculated by StringTie v2.0.6 (ref. 129) with default parameters.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The raw Illumina, PacBio HiFi and Oxford Nanopore sequencing data of the 72 accessions can be accessed in EMBL-ENA under the accession number PRJEB62038. The genome assemblies can be accessed in NCBI under the accession number PRJNA1033522. The data generated in this study can be accessed in Edmond (the Open Research Data Repository of the Max Planck Society, https://doi.org/10.17617/3.AEOJBL) (ref. 130), including genome assemblies (including Lu-1, Pa-1 and Istisu-1), gene and TE annotations, SNPs and SVs, pan-genome matrix and othrogroups. The RNA-seq dataset used in this study is downloaded from the NCBI SRA database (the accession numbers are included in Supplementary Table 11). The databases used in this study, including OrthoDB brassicales_odb10 (brassicales, 2020-08-05), NCBI NR database and Rfam database v14.8, are all public available. Source data are provided with this paper.
Code availability
The related code is available at GitHub (https://github.com/qclian/Pan_Ath) and Zenodo (https://doi.org/10.5281/zenodo.10567419) (ref. 131). All software used in the study are publicly available from the Internet as described in Methods and Reporting Summary.
References
Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J. O. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 14, 125–138 (2013).
Alonge, M. et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell 182, 145–161 e23 (2020).
Jiao, W. B. & Schneeberger, K. Chromosome-level assemblies of multiple Arabidopsis genomes reveal hotspots of rearrangements with altered evolutionary dynamics. Nat. Commun. 11, 989 (2020).
Lian, Q. et al. The megabase-scale crossover landscape is largely independent of sequence divergence. Nat. Commun. 13, 3828 (2022).
Zapata, L. et al. Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. Proc. Natl Acad. Sci. USA 113, E4052–E4060 (2016).
Capilla-Perez, L. et al. The synaptonemal complex imposes crossover interference and heterochiasmy in Arabidopsis. Proc. Natl Acad. Sci. USA 118, e2023613118 (2021).
Durand, S. et al. Joint control of meiotic crossover patterning by the synaptonemal complex and HEI10 dosage. Nat. Commun. 13, 5999 (2022).
Schmidt, C. et al. Changing local recombination patterns in Arabidopsis by CRISPR/Cas mediated chromosome engineering. Nat. Commun. 11, 4418 (2020).
Lowry, D. B. & Willis, J. H. A widespread chromosomal inversion polymorphism contributes to a major life-history transition, local adaptation, and reproductive isolation. PLoS Biol. 8, e1000500 (2010).
Lamichhaney, S. et al. Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax). Nat. Genet. 48, 84–88 (2016).
Harringmeyer, O. S. & Hoekstra, H. E. Chromosomal inversion polymorphisms shape the genomic landscape of deer mice. Nat. Ecol. Evol. 6, 1965–1979 (2022).
Tong, X. et al. High-resolution silkworm pan-genome provides genetic insights into artificial selection and ecological adaptation. Nat. Commun. 13, 5619 (2022).
Goel, M., Sun, H., Jiao, W. B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 277 (2019).
Nattestad, M. & Schatz, M. C. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics 32, 3021–3023 (2016).
Bayer, P. E., Golicz, A. A., Scheben, A., Batley, J. & Edwards, D. Plant pan-genomes are the new reference. Nat. Plants 6, 914–920 (2020).
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
Della Coletta, R., Qiu, Y., Ou, S., Hufford, M. B. & Hirsch, C. N. How the pan-genome is changing crop genomics and improvement. Genome Biol. 22, 3 (2021).
Jayakodi, M., Schreiber, M., Stein, N. & Mascher, M. Building pan-genome infrastructures for crop plants and their use in association genetics. DNA Res. 28, dsaa030 (2021).
Liu, Y. et al. Pan-genome of wild and cultivated soybeans. Cell 182, 162–176 e13 (2020).
Gao, L. et al. The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nat. Genet. 51, 1044–1051 (2019).
Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).
Tang, D. et al. Genome evolution and diversity of wild and cultivated potatoes. Nature 606, 535–541 (2022).
Shang, L. et al. A super pan-genomic landscape of rice. Cell Res. 32, 878–896 (2022).
Zhang, F. et al. Long-read sequencing of 111 rice genomes reveals significantly larger pan-genomes. Genome Res. 32, 853–863 (2022).
Qin, P. et al. Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations. Cell 184, 3542–3558 e16 (2021).
Hufford, M. B. et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 373, 655–662 (2021).
Jayakodi, M. et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature 588, 284–289 (2020).
Walkowiak, S. et al. Multiple wheat genomes reveal global variation in modern breeding. Nature 588, 277–283 (2020).
Sun, X. et al. Phased diploid genome assemblies and pan-genomes provide insights into the genetic history of apple domestication. Nat. Genet. 52, 1423–1432 (2020).
Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Vollger, M. R. et al. Increased mutation and gene conversion within human segmental duplications. Nature 617, 325–334 (2023).
Initiative, A. G. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Cao, J. et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat. Genet. 43, 956–963 (2011).
Gan, X. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419–423 (2011).
The 1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).
Durvasula, A. et al. African genomes illuminate the early history and transition to selfing in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 114, 5213–5218 (2017).
Zou, Y. P. et al. Adaptation of Arabidopsis thaliana to the Yangtze River basin. Genome Biol. 18, 239 (2017).
Goktay, M., Fulgione, A. & Hancock, A. M. A new catalog of structural variants in 1,301 A. thaliana lines from Africa, Eurasia, and North America reveals a signature of balancing selection at defense response genes. Mol. Biol. Evol. 38, 1498–1511 (2021).
Horton, M. W. et al. Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel. Nat. Genet. 44, 212–216 (2012).
Frachon, L. et al. Intermediate degrees of synergistic pleiotropy drive adaptive evolution in ecological time. Nat. Ecol. Evol. 1, 1551–1561 (2017).
Fransz, P. et al. Molecular, genetic and evolutionary analysis of a paracentric inversion in Arabidopsis thaliana. Plant J. 88, 159–178 (2016).
Barragan, A. C. et al. A truncated singleton NLR causes hybrid necrosis in Arabidopsis thaliana. Mol. Biol. Evol. 38, 557–574 (2021).
Michael, T. P. et al. High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. Nat. Commun. 9, 541 (2018).
Pucker, B. et al. A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set. PLoS ONE 14, e0216233 (2019).
Rabanal, F. A. et al. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 50, 12309–12327 (2022).
Kang, M. et al. The pan-genome and local adaptation of Arabidopsis thaliana. Nat. Commun. 14, 6259 (2023).
Hagmann, J. et al. Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage. PLoS Genet. 11, e1004920 (2015).
Anastasio, A. E. et al. Source verification of mis-identified Arabidopsis thaliana accessions. Plant J. 67, 554–566 (2011).
Simon, M. et al. DNA fingerprinting and new tools for fine-scale discrimination of Arabidopsis thaliana accessions. Plant J. 69, 1094–1101 (2012).
Long, Q. et al. Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden. Nat. Genet. 45, 884–890 (2013).
Sun, H., Ding, J., Piednoel, M. & Schneeberger, K. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34, 550–557 (2018).
Wlodzimierz, P. et al. Cycles of satellite and transposon evolution in Arabidopsis centromeres. Nature 618, 557–565 (2023).
Willing, E. M. et al. Genome expansion of Arabis alpina linked with retrotransposition and reduced symmetric DNA methylation. Nat. Plants 1, 14023 (2015).
Murat, F. et al. Understanding Brassicaceae evolution through ancestral genome reconstruction. Genome Biol. 16, 262 (2015).
Schranz, M. E., Lysak, M. A. & Mitchell-Olds, T. The ABC’s of comparative genomics in the Brassicaceae: building blocks of crucifer genomes. Trends Plant Sci. 11, 535–542 (2006).
Hu, T. T. et al. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet. 43, 476–481 (2011).
Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Mol. Plant 15, 1247–1250 (2022).
Van de Weyer, A. L. et al. A species-wide inventory of NLR genes and alleles in Arabidopsis thaliana. Cell 178, 1260–1272 e14 (2019).
Cheng, C. Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 89, 789–804 (2017).
Thangavel, G., Hofstatter, P. G., Mercier, R. & Marques, A. Tracing the evolution of the plant meiotic molecular machinery. Plant Reprod. 36, 73–95 (2023).
Simon, M. et al. Quantitative trait loci mapping in five new large recombinant inbred line populations of Arabidopsis thaliana genotyped with consensus single-nucleotide polymorphism markers. Genetics 178, 2253–2264 (2008).
Loudet, O., Chaillou, S., Camilleri, C., Bouchez, D. & Daniel-Vedele, F. Bay-0 x Shahdara recombinant inbred line population: a powerful tool for the genetic dissection of complex traits in Arabidopsis. Theor. Appl. Genet. 104, 1173–1184 (2002).
Durand, S., Bouche, N., Perez Strand, E., Loudet, O. & Camilleri, C. Rapid establishment of genetic incompatibility through natural epigenetic variation. Curr. Biol. 22, 326–331 (2012).
Bikard, D. et al. Divergent evolution of duplicate genes leads to genetic incompatibilities within A. thaliana. Science 323, 623–626 (2009).
Smith, L. M., Bomblies, K. & Weigel, D. Complex evolutionary events at a tandem cluster of Arabidopsis thaliana genes resulting in a single-locus genetic incompatibility. PLoS Genet. 7, e1002164 (2011).
Demirjian, C. et al. An atypical NLR gene confers bacterial wilt susceptibility in Arabidopsis. Plant Commun. 4, 100607 (2023).
Simon, M. et al. APOK3, a pollen killer antidote in Arabidopsis thaliana. Genetics 221, iyac089 (2022).
Russo, A. et al. Low-input high-molecular-weight DNA extraction for long-read sequencing from plants of diverse families. Front. Plant Sci. 13, 883897 (2022).
Marcais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).
Chakraborty, M., Baldwin-Brown, J. G., Long, A. D. & Emerson, J. J. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage. Nucleic Acids Res. 44, e147 (2016).
Alonge, M. et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol. 20, 224 (2019).
Naish, M. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374, eabi7489 (2021).
Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).
Liu, H., Wu, S., Li, A. & Ruan, J. SMARTdenovo: a de novo assembler using long noisy reads. GigaByte 2021, gigabyte15 (2021).
Vaser, R., Sovic, I., Nagarajan, N. & Sikic, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Huang, N. & Li, H. compleasm: a faster and more accurate reimplementation of BUSCO. Bioinformatics, 39, btad595 (2023).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2020).
Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126 (2018).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Wang, B. et al. High-quality Arabidopsis thaliana genome assembly with nanopore and HiFi long reads. Genomics Proteom. Bioinform. 20, 4–13 (2022).
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).
Ou, S. et al. Differences in activity and stability drive transposable element variation in tropical and temperate maize. Preprint at bioRxiv https://doi.org/10.1101/2022.10.09.511471 (2022).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–W439 (2006).
Ter-Hovhannisyan, V., Lomsadze, A., Chernoff, Y. O. & Borodovsky, M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 18, 1979–1990 (2008).
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Korf, I. Gene finding in novel genomes. BMC Bioinform. 5, 59 (2004).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
UniProt, C. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Goodstein, D. M. et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 40, D1178–D1186 (2012).
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 6, 31 (2005).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 49, 9077–9096 (2021).
Steuernagel, B. et al. The NLR-Annotator Tool enables annotation of the intracellular immune receptor repertoire. Plant Physiol. 183, 468–482 (2020).
Li, P. et al. RGAugury: a pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants. BMC Genomics 17, 852 (2016).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Tian, T. et al. agriGO v2.0: a GO analysis toolkit for the agricultural community, 2017 update. Nucleic Acids Res. 45, W122–W129 (2017).
Lamesch, P. et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 40, D1202–D1210 (2012).
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
Lian, Q., Chen, Y., Chang, F., Fu, Y. & Qi, J. inGAP-family: accurate detection of meiotic recombination loci and causal mutations by filtering out artificial variants due to genome complexities. Genomics Proteom. Bioinform. 20, 524–535 (2022).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19, 1655–1664 (2009).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).
Suyama, M., Torrents, D. & Bork, P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609–W612 (2006).
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Gao, F., Ming, C., Hu, W. & Li, H. New software for the fast estimation of population recombination rates (FastEPRR) in the genomic era. G3 6, 1563–1571 (2016).
Wang, D., Zhang, Y., Zhang, Z., Zhu, J. & Yu, J. KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies. Genomics Proteom. Bioinform. 8, 77–80 (2010).
Zhang, Z. KaKs_Calculator 3.0: calculating selective pressure on coding and non-coding sequences. Genomics Proteom. Bioinform. 20, 536–540 (2022).
Klepikova, A. V., Kasianov, A. S., Gerasimov, E. S., Logacheva, M. D. & Penin, A. A. A high resolution map of the Arabidopsis thaliana developmental transcriptome based on RNA-seq profiling. Plant J. 88, 1058–1070 (2016).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range. Edmond https://doi.org/10.17617/3.AEOJBL (2024).
Lian, Q. The related code for a pan-genome of 69 Arabidopsis thaliana accessions. Zenodo https://doi.org/10.5281/zenodo.10567419 (2024).
Acknowledgements
We thank the Max Planck-Genome-centre Cologne (MP-GC) for DNA extraction, library preparation and sequencing, H. Sun and F. A. Rabanal for helpful discussions and comments, and N. Donnelly for proofreading the manuscript. We thank A. Hancock and The Versailles Arabidopsis Stock Center for providing genetic material. This work was supported by core funding from the Max Planck Society and an Alexander von Humboldt Fellowship to Q.L. This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy—EXC 2048/1–390686111 (to K.S. and R.M.) and TRR 341/1–456082119 to K.S., as well as by the European Research Council (ERC) with the grant ‘INTERACT’ (802629) to K.S.
Funding
Open access funding provided by Max Planck Society.
Author information
Authors and Affiliations
Contributions
Q.L., F.R., K.S. and R.M. designed the research. Q.L., K.S. and R.M. analyzed the data. B.W. and B.M. generated plant materials. B.H., C.-L.R. and L.G. supervised the whole-genome sequencing work. Q.L., K.S. and R.M. wrote the article with input from the other authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Grey Monroe, Kentaro Shimizu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Phylogenetic tree based on SNPs in the 72 A. thaliana genomes.
Tree branches (accessions) are coloured according to the genetic classification. Europe (red), Asia (blue), Madeira (green), Africa (orange) and admixture (purple).
Extended Data Fig. 2 Comparison of the assembly and estimated lengths of genomes and centromeres.
(a) Comparison of the assembly and genome sizes of the 69 accessions. The genome sizes were estimated based on k-mer from Illumina reads. (b) Comparison of the assembly and centromere sizes of the 69 accessions. The centromere sizes were estimated based on sequencing depths of Illumina reads. The estimated sizes are shown by grey bars. The assembly size of each accession are coloured according to its genetic classification. The most complete 46 genomes are marked by stars.
Extended Data Fig. 3 The distribution of length and minor allele frequency of SVs in the 69 genomes.
(a) The chromosomal distribution of SVs ( > 500 kb) in the 69 genomes along chromosomes of Col-PEK. The heatmap indicates the gene density. The centromeres are indicated by grey circles and segments. The colour, shape and size of the individual points represent the size, type and allele frequency of the SVs. (b-k) Length distributions of SVs (deletions, insertions, duplications, inversions and translocations) with size from 20 bp to 10 kb and longer than 10 kb, separately. (l-m) The minor allele distribution of SVs with sizes from 20 bp to 10 kb and longer than 10 kb, separately. The colour and shape of points represent the type of the SVs.
Extended Data Fig. 4 Alignment of long-reads in the breakpoint regions of the inversion in Stw-0.
The top, middle and bottom panels show the alignment of long-reads from Stw-0 against Col-PEK, the closest accession Ct-1, and the Stw-0 assemblies, with a window of 30 kb covering the left and right breakpoints of the detected inversion, respectively.
Extended Data Fig. 5 The distribution of genetic variation in the 72 genomes along chromosomes of Col-PEK.
Distribution map of genetic variation in the 72 genomes along the chromosomes profiled in 100 kb windows. a: Chromosomes, centromeres are marked by grey; b: Gene density; c: SNP density; d: small indel density; e: SV density (69 accessions); f: historical recombination map (4Ner per kb), centromeres masked; g: TE density.
Extended Data Fig. 6 Presence and absence of pan gene families in the 69 A. thaliana genomes.
Each column indicates a non-redundant gene (gene family), and each row indicates an accession. The gene families were grouped by their categories including core, softcore, dispensable and private. Accessions are ordered according to their genetic relationship.
Supplementary information
Supplementary Information
Supplementary Figs. 1–22.
Supplementary Tables
Supplementary Tables 1–11.
Source data
Source Data Fig. 1
Statistical source data.
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lian, Q., Huettel, B., Walkemeier, B. et al. A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range. Nat Genet 56, 982–991 (2024). https://doi.org/10.1038/s41588-024-01715-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-024-01715-9
- Springer Nature America, Inc.