Genomic insights from the first chromosome-scale assemblies of oat (Avena spp.) diploid species
Cultivated hexaploid oat (Common oat; Avena sativa) has held a significant place within the global crop community for centuries; although its cultivation has decreased over the past century, its nutritional benefits have garnered increased interest for human consumption. We report the development of fully annotated, chromosome-scale assemblies for the extant progenitor species of the As- and Cp-subgenomes, Avena atlantica and Avena eriantha respectively. The diploid Avena species serve as important genetic resources for improving common oat’s adaptive and food quality characteristics.
The A. atlantica and A. eriantha genome assemblies span 3.69 and 3.78 Gb with an N50 of 513 and 535 Mb, respectively. Annotation of the genomes, using sequenced transcriptomes, identified ~ 50,000 gene models in each species—including 2965 resistance gene analogs across both species. Analysis of these assemblies classified much of each genome as repetitive sequence (~ 83%), including species-specific, centromeric-specific, and telomeric-specific repeats. LTR retrotransposons make up most of the classified elements. Genome-wide syntenic comparisons with other members of the Pooideae revealed orthologous relationships, while comparisons with genetic maps from common oat clarified subgenome origins for each of the 21 hexaploid linkage groups. The utility of the diploid genomes was demonstrated by identifying putative candidate genes for flowering time (HD3A) and crown rust resistance (Pc91). We also investigate the phylogenetic relationships among other A- and C-genome Avena species.
The genomes we report here are the first chromosome-scale assemblies for the tribe Poeae, subtribe Aveninae. Our analyses provide important insight into the evolution and complexity of common hexaploid oat, including subgenome origin, homoeologous relationships, and major intra- and intergenomic rearrangements. They also provide the annotation framework needed to accelerate gene discovery and plant breeding.
KeywordsAveninae Avena Oat Hi-C Flowering time Crown rust resistance Polyploidy
Basic Local Alignment Search Tool
Genome-wide association studies
Million years ago
Quantitative trail loci
Resistance gene analog
Single nucleotide polymorphisms
Oat (Avena sativa L.) is a nutritionally important crop throughout the world. It is ranked 6th in world cereal production , and while its primary use continues to be as a livestock feed, its uses as a human food and for cosmetics continue to increase . Among the many nutritional benefits of oat are its high levels of calcium, β-glucan soluble fiber [3, 4, 5, 6], and high-quality oil and protein [7, 8]. Oat seed contains no gluten and only low levels of gluten-related prolamins and therefore is a healthy diet alternative for many people who cannot tolerate dietary gluten. Oat is high in polyphenolic avenanthramides having antioxidant, anti-inflammatory, and antiatherogenic properties . Oat also contains two classes of saponins: avenacosides (sugars bound to steroids) and avenacins (sugars bound to triterpenoid), both of which have been shown to lower cholesterol, stimulate the immune system, and have anti-carcinogenic properties . Oat also has many topical uses, as it has a soothing effect on skin and has been used to treat dry, itchy skin ; oat has also been shown to have sun-blocking properties , and it is often found in products to treat eczema, psoriasis, and other skin conditions [12, 13].
Common oat (A. sativa) and red oat (A. byzantina C. Koch) are allohexaploids (2n = 6x = 42, AACCDD subgenomes) belonging to the Poeae Tribe of the Poaceae  and are thought to have been domesticated from wild-weedy A. sterilis L. , a species that arose from hybridization between a CCDD allotetraploid closely related to modern A. insularis Ladiz. and an AsAs diploid . Several variants of the A-subgenome diploids exist (Ac, Ad, Al, Ap, and As ;) and are known to harbor several genetic features of significance, including major crown rust resistance genes that have been transferred into hexaploid oat cultivars [18, 19]. The A-genome diploids have also been identified as potential gene sources for improving soluble fiber and protein . The A-subgenome is also part of a major intergenomic translocation (7C-17A) in A. sativa-A. sterilis that has been associated with adaptation to winter hardiness—key elements in oat production that likely contributed to the plant’s ability to shift from Mediterranean winter ecology to Eurasian spring-summer cultivation .
The C-subgenome chromosomes have a high amount of diffuse heterochromatin along their entirety ; this is a genetic feature not seen in the A and D chromosomes, where heterochromatin is localized and seemingly concentrated around the centromeres, at the telomeres, and in flanking secondary constrictions where rRNA genes are located. Among the important genetic features in the C-subgenome is a terminal translocation segment on the long arm of 21D which carries a putative CSlF6c locus that likely has a negative effect on seed soluble fiber content [23, 24]. Linkage has also been demonstrated between the chromosome 5C telomeric knob in allotetraploid A. magna Murphy et Terrell (CCDD subgenomes) and co-segregating genes controlling awn production and basal abscission layer formation which have been implicated in the domestication of common oat .
Despite the historical importance of oat and the renewed interest in its nutritional value, a complete genome sequence of oat has yet to be reported. Indeed, the A. sativa genome is large (> 12 Gb ), duplicated, complex, highly repetitive, and characterized by several major intra- and intergenomic rearrangements—making full genome assembly of the hexaploid difficult . Here we report the development of fully annotated, chromosome-scale assemblies for the extant progenitor species of the As- and Cp-subgenomes, A. atlantica B.R.Baum & Fedak and Avena eriantha Durieu., respectively. Using these assemblies, we (i) identified and quantified repetitive element content in the genome, including centromeric and telomeric repeats, (ii) analyzed syntenic relationships with other cereal grains and homoeologous relationships within oat using consensus linkage maps , (iii) identified putative candidate genes for flowering time  and crown rust resistance  relative to recently published genome-wide association studies (GWAS), (iv) estimated the age of the evolutionary split between the A- and C-subgenomes using synonymous substitution rates (Ks) analysis, and (v) examined the genetic diversity and phylogenetic relationship from a resequencing panel of 76 A- and C-genome Avena species.
Whole-genome sequencing and assembly
We selected the A. atlantica accession Cc 7277 and the A. eriantha accession CN 19328 for whole-genome shotgun sequencing. Both accessions are highly inbred and phenotypically homogeneous and represent type accessions for their respective species. A total of 31,544,396 and 28,257,346 PacBio reads were generated across 122 (RSII and Sequel) and 54 (Sequel) SMRT cells generating a total of 325.9 (~84× coverage) and 276.6 (~71× coverage) Gb of sequence data for A. atlantica and A. eriantha, respectively. The longest reads for each species, 194,884 and 151,576 bp, came from the Sequel instrument. The N50 read length for A. atlantica and A. eriantha was 18,658 and 15,102 bp, respectively. In addition to PacBio sequencing, a total of 192 Gb for A. atlantica and 40 Gb for A. eriantha of 2 × 150 bp Illumina sequences were generated. A k-mer analysis (at k = 21 scale) using Genoscope  predicted a genome size of 3.72 Gb with 0.07% heterozygosity and a repeat fraction of 78% for A. atlantica and a genome size of 4.17 Gb with a 0.12% heterozygosity and a repeat fraction of 76% for A. eriantha. The relative magnitude of these values agree well with those reported by Bennett and Smith  and Yan et al. , both of which report that the genomes of the A-genome diploids are ~ 15% smaller than the C-genome diploids. However, former estimates determined by replicated flow cytometry measurements ranged in size from 4.1 to 4.6 Gb for A-genomes and from 5.0 to 5.1 Gb for C-genomes . The differences in genome size predicted by k-mer vs. flow cytometry analyses are likely a reflection of the significant repeat fraction in the oat genome that is difficult to account for using a k-mer analysis.
Summary statistics for the canu  and Hi-C assemblies for A. atlantica and A. eriantha
Number of scaffolds
Total size of scaffolds (bp)
Longest scaffold (bp)
Shortest scaffold (bp)
Number of scaffolds > 1 M nucleotides
N50 scaffold length
L50 scaffold count
Scaffold % A
Scaffold % C
Scaffold % G
Scaffold % T
Scaffold % N
Scaffold N nt
Scaffold % non-ACGTN
Percentage of assembly in scaffolded contigs
Average number of contigs per scaffold
Average length of breaks (20 or more Ns) between contigs
Number of contigs
Number of contigs in scaffolds
Number of contigs not in scaffolds
Total size of contigs
Number of contigs > 1 M nt
N50 contig length
L50 contig count
Contig % A
Contig % C
Contig % G
Contig % T
Contig % N
To improve the Canu assemblies, contigs were further scaffolded by chromatin-contact maps using DoveTail Chicago® and Hi-C libraries. Chicago® library contact maps are based on purified DNA that is reconstituted in vitro and thus limited to chromatin associations no larger than the size of the purified input DNA fragments (< 100 kb). Nonetheless, they are ideal for detecting and correcting miss-joins in de novo assemblies as well as short-range scaffolding . Approximately 73× coverage of 1–100 kb read pairs (2 × 150) were generated from Chicago® libraries for each Avena species and used to scaffold the Canu assemblies using the HiRiSE™ scaffolder. In total, 334 and 158 breaks were made, while 1157 and 2962 joins were made in the A. atlantica and A. eriantha assemblies, respectively. The net effect of these changes was to decrease the number of total scaffolds to 3118 in the A. atlantica assembly and to 5263 in the A. eriantha assembly, which was accompanied by a slight decrease in N50 (4310 and 1.314 kb, respectively) for each assembly. Whenever a join was made between contigs, an “N” gap, consisting of 100 Ns, was created. The total percent of the genome in gaps, for both species, was less than 0.1%.
The Chicago®-based assemblies were then further scaffolded using in vivo Hi-C libraries, created from native chromatin to produce ultra-long-range mate pairs. Mate pair reads (10–10,000 kb) representing a physical coverage of 2749× and 513× were generated for the A. atlantica and A. eriantha genomes and scaffolded using the HiRiSE™ scaffolder. In total, 922 joins and 2614 joins (plus three breaks) in the A. atlantica and A. eriantha were made, respectively, producing ultra-long scaffolds, putatively representing full-length chromosomes and/or chromosome arms. The HiRiSE™ assembly for A. atlantica had a scaffold N50 of 513.2 Mb, and an L50 of 4, spanning a total sequence length of 3.685 Gb. The longest scaffold spanned 577.8 Mb. Similarly, the A. eriantha assembly had a scaffold N50 of 534.8 Mb, an L50 of 4, and spanned a total sequence length of 3.778 Gb with the longest scaffold reaching 588.3 Mb. Scaffold joins produced by the Hi-C mate pairs introduced new “N” gaps in the assembly (each consisting of 1000 Ns), thereby increasing the number of gaps in the assembly to 2079 and 5576 for A. atlantica and A. eriantha, respectively. The final percentage of “N” nucleotides in the final assemblies was less than 0.1%, with the average gap size of 600 and 578 bp, respectively (Table 1).
The longest eight scaffolds of the A. atlantica assembly, presumably representing two chromosome arms (205 and 278 Mb) and six full-length chromosomes (448–577 Mb), consisted of > 96% of the total sequence length from the Canu assembly. Similarly, the longest seven scaffolds, ranging in size from 455 to 588 Mb, presumably representing each of the seven haploid A. eriantha chromosomes, were composed of > 97% of the total Canu assembly sequence. For simplicity, scaffolds representing each of the seven chromosomes from each species are referred to forthwith by size (longest to shortest) as AA1-AA7 and AE1-AE7. The scaffolds in the A. atlantica and A. eriantha assemblies that remain unintegrated into one of the chromosome-scale pseudomolecules are relatively small and repetitive, with an average size of 61 and 38 kb, which likely contributed to the inability of the proximity-guided assembler to confidently place these contigs within the framework of the chromosomes—specifically the low number of interactions on a short fragment as well as the inability to discern interaction distance differences over the short molecule.
Chromosome arm merging
Physical map and Linkage map assignment. Haplotag markers from the consensus map of Latta et al.  where used to assign scaffold assemblies to linkage groups. Two scaffolds mapped to LG 2 and were merged
A. atlantica Hi-C scaffold
A. atlantica chromosome
Analysis of repetitive elements
The repeat fraction of the Avena genome assemblies was identified and annotated using RepeatModeler and RepeatMasker. In total, ~ 83% of each genome was classified as repetitive, with the most commonly identified repetitive elements being classified as long terminal repeat retrotransposons (LTR-RTs); LTR-RTs are the most abundant genomic components in flowering plants [40, 41], and their abundance is strongly correlated with genome size . Within published plant genomes, repeat content varies widely, ranging from 3% for the minute 82 Mb genome of Utricularia gibba L.  to 85% for maize . Given the large size of these genomes, it is not surprising that < 20% of the genome is classified as non-repetitive.
In addition to the interspersed repeat elements, ~ 0.5% of the genome was classified as low complexity, satellite, microsatellite or telomeric repeat (see genomic feature section below). Indeed, 5217 and 3404 putative microsatellite loci were identified, with the most common di-, tri- and tetranucleotide repeat motif identified being (AT)n, (AAC)n or (GGC)n and (TTTA)n, in A. atlantica and A. eriantha, respectively. To date, no microsatellites have been generated specifically for the Avena diploid species – thus these new putative microsatellite loci represent important genetic tools for studying diversity and specifically for advancing breeding in the A-genome diploids.
Transcriptome assembly and functional annotation
The A. atlantica and A. eriantha transcriptomes, which consisted of 51,223 and 47,361 scaffolded isoforms, the Brachypodium cDNA and peptide models (v 1.0; Ensembl genomes) and the uniprot-sprot database were provided as primary evidence for annotation in the MAKER pipeline . The RNA-Seq data mapped with high efficiency to the assemblies, with > 96% of the reads mapping to their respective genome at 93.1% concordance for pair alignment rates, suggestive of high-quality genome assemblies for both species. The MAKER pipeline identified a total of 51,100 and 49,105 gene predictions, with mean transcript lengths of 3018 and 3153 bp, and with 70% and 66% of the annotations having annotation edit distance (AED) measures < 0.25, for A. atlantica and A. eriantha genomes, respectively. AED integrates sensitivity, specificity, and accuracy measurement to calculate annotation quality, where AED values < 0.25 are indicative of high-quality annotations . The mean G+C content of the transcripts in both species was ~ 52%. The increase in G+C content within coding regions relative to the overall G+C content of the genome (~ 44%) is a well-known phenomenon and is hypothesized to be the result of GC-biased gene conversion – a process by which the G+C content of DNA increases due to gene conversion during recombination .
The completeness of the gene space was quantified using BUSCO which provides a quantitative measure for genome and transcriptome completeness based on a core set of highly conserved plant-specific single-copy orthologs . Of the 1440 plant-specific orthologs, 1387 (96.3%) were identified in the A. atlantica genome assembly as full length, while 1395 (96.9%) were identified in the A. eriantha assembly as full length, suggesting high-quality and complete genome assemblies. As expected for diploid species, the level of gene duplication, as identified by BUSCO for the conserved orthologous genes was low for both species (2.2% and 2.3%). Similarly, a BUSCO analysis of the transcript and protein annotation sets produced by MAKER identified similar numbers of conserved orthologs for both species, which is indicative of a successful annotation process.
Pericentromeric regions, associated with reduced recombination relative to physical distance, were evident from the linkage and physical map comparison in A. atlantica (Fig. 1). The observation that gene density is substantially reduced provided further evidence that that these regions represented centromeric regions in both species, as has been well-documented previously in other eukaryotes [60, 61]. Centromeres in most plant species are complex but are dominated by megabase-sized arrays of tandemly arranged monomeric satellite repeats. While complex and highly diverse among plant species, they commonly share a unit length ranging between 150 and 180 bp, which is close to the size of the nucleosome unit . Melters et al.  showed that due to the relative size of the centromere, the most common repeat found in whole-genome sequencing data is the putative centromeric repeat. Using the output of RepeatModeler from A. eriantha, we identified a high-copy-number 159 bp tandem repeat that aligned specifically with the putative centromere location in each of the A. atlantica and A. eriantha chromosomes (Fig. 2; Additional file 2). Although the 159 bp repeat is similar in size to the putative centromeric repeats found in other grass species (e.g., B. distachyon, 156 bp; H. vulgare, 139 bp; Oryza brachyantha A.Chev. & Roehr., 154 bp; Z. mays, 156 bp), not surprisingly it shares little sequence similarity with the centromeric repeats of those species. Indeed, centromeric repeats exhibit little to no evidence of sequence similarity beyond ~ 50 million years of divergence . As has been documented in other plant species, these putative centromeric repeats span a large portion (often > 50 Mb) of the A. atlantica and A. eriantha chromosomes, suggesting the presence of large pericentromeric heterochromatic regions [64, 65]. Moreover, the positioning of the centromeres, as defined by this putative repeat and the gene density plots, is consistent with the cytological positioning of the centromere, which suggests that the centromeres in A. atlantica are almost all metacentric to submetacentric, while the centromeres in A. eriantha are almost all sub-telocentric [66, 67]. Indeed, per our analyses, we identified three metacentric, two submetacentric and two sub-telocentric chromosomes in A. atlantica, and five sub-telocentric, one submetacentric and one metacentric chromosome in A. eriantha (Fig. 2).
Repeatmodeler annotated a putative telomeric satellite sequence (665 bp) for A. eriantha (Additional file 1: Table S1). A homologous sequence (639 bp) with significant homology (E-value = 0.0) and alignment identity (Identity = 80%; Gap = 6%) was identified from the repeat sequences identified by RepeatModeler (Additional file 2). BLAST searches of the assemblies with their respective satellite telomeric repeat sequence identified enriched regions of the telomeric repeat on all seven of the chromosomes for each species (Fig. 2). In A. atlantica, telomeric satellite sequences were located toward the distal end of each chromosome; however, in A. eriantha the location of the sequence is more dispersed, being found primarily at the end of the chromosomes on ten of the 14 chromosome arms, while in a few instances being interspersed interstitially. Interstitial telomere-like repeats have been reported in several plant species including Anthurium, Vicia, Sideritis, Typhonium, and Pinus where they were implicated in chromosome rearrangements, including inversions, translocations, and chromosome fusions [68, 69, 70, 71]. While chromosomal rearrangements are common in Avena, we caution that the very repetitive nature of telomeric sequences makes them susceptible to collapse during the assembly process and are thus inherently difficult to order and orient in the Hi-C scaffolding process.
SNP discovery and genetic diversity
To characterize the diversity and phylogenetic relationships among Avena A- and C-genome diploid species, we resequenced at 10X coverage 61 A-genome diploid accessions (including A. atlantica, A. brevis, A. canariensis, A. damascena, A. hirtula, A. longiglumis, A. lusitanica, A. strigosa, A. strigosa-brevis, A. strigosa-hispanica, A. strigosa-nuda, and A. wiestii) and 10 C-genome diploids (A. clauda, A. eriantha, A. ventricosa; Additional file 3: Table S2). The resequencing produced 40 Gb sequence data per accession (Additional file 3: Table S2). The A-genome accession reads were mapped against the A. atlantica genome, while the C-genome species were mapped against the A. eriantha genome for SNP discovery using InterSnp . InterSnp uses BAM files to call SNPs between samples based on consensus alleles at each genomic position, filtered to produce a dataset with 0% missing data across all lines. Considering the cleistogamous nature of the accessions included, any SNPs with > 5% heterozygous calls were deemed likely to result from spurious read-mapping and were removed from the dataset. Using a minimum allele frequency threshold of < 0.1, a total of 286,567 and 3,185,959 putative SNPs were identified within the A-genome and C-genome diploid datasets, respectively, and used by SNPhylo  to investigate phylogenetic relationships. SNPhylo reduces oversampling effects at linked SNPs using an LD threshold (0.1) with a sliding window of 500,000 base pairs. Thus, a total of 7221 and 11,530 SNPs, with an average of 1032 and 1647 SNPs per chromosome, were selected prior to tree construction for the A-genome and C-genome diploids dataset, respectively (Additional file 4: Table S3).
The bootstrapped maximum likelihood phylogenetic trees were rooted with either the A. eriantha accession CN 19328 for the A-genome accessions tree (Fig. 5a) or with the A. atlantica accession Cc 7277 for the C-genome accessions tree (Fig. 5b). The A-genome diploids formed two distinct clades: one of these consisted primarily of accessions classified in taxa having the AsAs subgenome, which had previously been described by Rajhathy and Morrison  and Leggett  as including A. atlantica, A. hirtula, the domesticated forms of A. strigosa, and A. wiestii; and a second clade comprised mostly of A. canariensis (AcAc), Syrian accessions of A. damascena (AdAd), A. longiglumis (AlAl), and three floret-shattering accessions that were possibly misidentified as A. hirtula and A. lusitanica. As expected, the spikelet-shattering A. atlantica occupied the basal position on the AsAs branch of the tree, while all of the A. strigosa (domesticated AsAs) genotypes formed a clade at the top of the tree and included a single accession of weedy A. wiestii (CIav 1994) that, upon inspection of the panicles, more closely resembled a long-awned, semi-shattering A. strigosa genotype.
The A. strigosa branch shows clearly the effect of a domestication bottleneck. This branch of the tree is subdivided into two distinct sub-branches. The upper sub-branch consists predominantly of genotypes of Iberian origin (i.e., CN 25698, CIav 9019, CIav 9036, etc.) and includes seven homogeneous accessions that are derivatives of the Brazilian ‘Saia’ variety of forage oat (i.e., CIav 7010, PI 291990, etc.). Interestingly, the A. hispanica genotypes form a unitary subclade within this branch that is strongly supported by the bootstrap value. The lower sub-branch is comprised of accessions from outside the Iberian Peninsula (PI 83721, PI 287314, PI 304557, CIav 9022, etc.) and includes all of the A. strigosa-nuda varieties. Since A. brevis strains are distributed in both branches, there is no evidence to confirm its identity as a distinct taxon within or apart from A. strigosa. The presence of branches containing multiple, genetically indistinct accessions indicates there is a high degree of duplication being curated within the USDA and PGR-Canada gene banks for A. strigosa.
The remainder of the A-genome tree consists of entirely wild genotypes. Avena lusitanica is not a universally accepted taxon, and the presence of these strains on various branches of the tree confirms that this is not a valid independent taxonomic entity; instead, it is part of the floret-dispersing A. hirtula-wiestii complex of semi-desert and Mediterranean scrub ecotypes of the AsAs biological species complex. The presence of three floret-shattering accessions from Morocco that were previously identified as A. damascena (AdAd) on the AsAs branch (PI 657468, PI 657471, PI 657472) and the two Syrian A. damascena genotypes on the other branch (CN 19457, CN 19459) suggests that the Moroccan group are misidentified and are therefore, like A. lusitanica, either members of the AsAs A. hirtula-wiestii-atlantica-strigosa complex or, possibly, misclassified accessions of tetraploid A. barbata.
The rooted C-genome tree had the lone A. ventricosa (CvCv) accession at the base of the C-genome branch that consisted of two subclades. The more basal branch consisted of accessions of spikelet-shattering A. eriantha (CpCp) from Algeria (CN 24022) and four samples of A. eriantha from an extended population growing in the Middle Atlas Mountains of Morocco (PI 657575–8). The other branch included Algerian (CN 24040) and Turkish (CN 19238) accessions of floret-shattering A. clauda (CpCp) along with Iranian (CN 19256) and Algerian (CN 19328, the reference genome) A. eriantha genotypes.
The age of the Poaceae family has been difficult to establish, with varying ages reported in the literature [76, 77]. Schubert et al.  recently reported the use of newly available paleobotanical fossils to established the age of the family to be approximately 120.8 million years ago (Ma), with the split of the Aveneae, Brachypodieae, and BOP clades occurring approximately 44.3, 51.8, and 80.2 Ma, respectively - suggesting that the grasses have a lower nucleotide substitution rate than the other angiosperms . We calculated the rate of synonymous nucleotide substitutions per synonymous site (Ks) for orthologous gene pairs between the A. atlantica and A. eriantha assemblies using the CodeML  tool on the CoGe platform (genomevolution.org/coge). A total of 18,002 duplicate gene pairs were identified with a clear peak seen at Ks = 0.0875 (Additional file 5: Figure S1). From the node estimates reported by Schubert et al., we calculated an average substitution rate for the Pooideae lineage of 3.39E-09, suggesting that speciation between A. atlantica and A. eriantha occurred between 5.4–12.9 million years ago (Ma), depending on whether a core eukaryotic-based synonymous mutation rate or the calculated lineage specific rate for Pooideae was used in the calculation [81, 82]. As seen in the SynMap dotplot alignment (Additional file 6: Figure S2), significant synteny was observed between the Avena chromosomes consisting of 187 syntenic blocks with 21,021 collinear genes pairs (112 genes/block) with 98.2% coverage across both the A. atlantica and A. eriantha genomes. As expected, given the relatively close ancestry of the species, the size (bp) of the syntenic blocks between species was highly correlated (R2 = 0.88; Additional file 6: Figure S2C). The large blocks of syntenic genes are suggestive of orthologous relationships between the chromosomes of the species (Additional file 6: Figure S2A). For example, slightly more than 77% (349 Mb) of the syntenic sequence found on AA2 is derived from AE5, suggesting that they are orthologs. Indeed, using a majority rule (> 50% syntenic sequence) we identified the following orthologous chromosome pairs: AA1 + AE6 (61%; 248 Mb); AA2 + AE5 (77%; 349 Mb); AA3 + AE3 (74%; 318 Mb); AA4 + AA1 (71%; 271 Mb); AA7 + AE2 (57%; 274 Mb); with AA5 and AA6 sharing orthology with several A. eriantha chromosomes (Additional file 6: Figure S2B).
The Poaceae family consists of many agronomically important species, commonly referred to as cereals, that are found in three subfamilies: Oryzoideae (rice), Panicoideae (maize, sorghum) and Pooideae (wheat, barley, oat and rye). Pooideae forms 14 tribes, including the tribes Brachypodieae, Poeae (syn Aveneae, including oat) and Triticeae (barley, rye, and wheat), with Poeae and Triticeae tribes having separated ~ 49 Ma . This agrees well with the Ks analyses presented here for A. atlantica and A. eriantha and with the published Hordeum vulgare genome , which both show a clear peak at 0.3 – suggestive of a divergence time of 44 Ma (per the calculated lineage specific rate for Pooideae). As expected, the Ks analyses from the Avena comparisons with the B. distachyon genome (International Brachypodium Initiative, 2010) suggested a more distant divergence of 47–51 Ma for the split of the Avena–Brachypodium lineages (Additional file 5: Figure S1).
SynMap was also used to investigate syntenic relationships between the Avena and Hordeum chromosomes (Additional file 7: Figure S3 and Additional file 8: Figure S4). Although more syntenic blocks (719 and 714) were identified in the Avena–Hordeum comparisons, they were smaller – consisting of ~ 8.5 genes/block, accompanied by a lower syntenic block size correlation (R2 = 0.35 and 0.41; Additional file 7: Figure S3C and Additional file 8: Figure S4C). The decrease in block size and correlation is reflective of the more distant evolutionary relationship between the species. Nonetheless, the shared ancestry between the two Pooideae species was evident as seen by substantial synteny observed across all seven Avena–Hordeum chromosomes comparisons (Additional file 7: Figure S3A and Additional file 8: Figure S4A). As expected, large, proximal, non-syntenic regions were observed in regions corresponding to putative centromeres where gene density is substantially reduced [60, 61]. The synteny observed among the Avena and Hordeum chromosomes suggests several homeologous relationships. For example, Hordeum chromosome 1H is clearly orthologous with Avena chromosome AA2 and AE5. Indeed, of the syntenic sequence on 1H, 99% (116 Mb) was syntenic to AA2 and 85% (105 Mb) syntenic to AE5 – which is not surprising since we previously showed that AA2 and AE5 are orthologs (see above). Using a simple majority rule (> 50% syntenic sequence) the following are putative Hordeum–Avena orthologs: 1H + AA2/AE5; 2H + AA5/AE4; 3H + AA3/AE3; 6H + AA7/AE2; and 7H + AA1/AE6. The specific A. atlantica orthologs of 4H and 5H are likely AA6 and AA4, respectively; however, rearrangements obscure the likely orthologs for A. eriantha (Additional file 7: Figure S3B and Additional file 8: Figure S4B).
Ancestral subgenome groups (A-, C- and D-) designation for each of the 21 consensus linkages groups reported for A. sativa . Haplotag markers mapping to (A) A. atlantica and (B) A. eriantha chromosomes, where highest haplotag mapping are colored red and transition to white as the number of haplotags mapping decreases
Utility of the genome assemblies
Reference-quality, de novo whole-genome sequence assemblies for two highly repetitive ~ 4 Gb Avena diploid species were produced using a hybrid approach involving PacBio long reads, Illumina short reads, and both in vitro and in vivo chromatin-contact mapping. The whole-genome reference assemblies for As- and Cp-genome oat diploids provided for the first time in this paper represent powerful tools for identifying genes that underlie adaptive, disease resistance, and grain-quality traits critical for oat improvement. The utility of these whole-genome references was demonstrated first by analyzing sequences homologous to heading-date QTL-containing regions that were previously identified via GWAS in common hexaploid oat (A. sativa) to find linked candidate genes in A. atlantica and A. eriantha. Additionally, we used these references in successfully identifying RGAs homologous to oat crown rust resistance genes.
Avena atlantica retains a remarkable degree of synteny in comparison with barley while A. eriantha has undergone a relatively greater degree of chromosomal rearrangement, suggesting the presence of an underlying genomic instability in the C-genome diploids. This might be related to the abundant heterochromatin, including the underlying pAm1 repeat motif, distributed throughout the chromosomes of this genome (Fig. 2b, Track 5 ;). Their genome sequences shed enormous insight into the complex evolutionary processes that have led to the appearance of cultivated diploid, tetraploid, and hexaploid oat going back millions of years. These processes included responses to natural selective events such as the Zanclean Cataclysm ~ 5 Ma and repeated cycles of global climate change characterized by boreal glacial maxima interspersed with humid periods and desertification due to northerly expansions of the Saharan and Arabian Deserts [99, 100, 101].
The oat community has struggled without a reference genome for decades. Finally, we have complete references for what are, essentially, all three component genomes of cultivated hexaploid oat and the four known subgenomes of the genus, given the close correspondence between Avena subgenomes A, B, and D. Moreover, once a complete hexaploid reference is available, the utility of these component genomes will increase further, as they will provide a precise roadmap of the structural and functional evolutionary steps that took place in the formation of this unique and important polyploid species.
Plant material and nucleic acid extraction
For whole-genome assembly, young leaf tissue (~ 14–21 days post emergence), dark treated for 72 h, from A. atlantica (CC7277; T. Langdon, Aberystwyth University, Wales, UK) and A. eriantha (BYU132; EN Jellen, Brigham Young University, Provo, UT) was flash-frozen and sent to the Arizona Genomics Institute (AGI; Tucson, AZ, USA) for high molecular weight DNA extraction. For the diversity panel, DNA from 76 accessions of diploid A- and C-subgenome species (Additional file 3: Table S2) was extracted from 30 mg of freeze-dried leaf tissue using a protocol devised by Sambrook et al.  with modifications described by Todd and Vodkin . All plants were grown in greenhouses at Brigham Young University (BYU) using Sunshine Mix II (Sun Gro, Bellevue, WA, USA) supplemented with Osmocote fertilizers (Scotts, Marysville, OH, USA) and maintained at 25 °C under broad-spectrum halogen lamps, with 12-h photoperiods.
DNA sequencing and read processing
For whole-genome sequencing, large-insert SMRTBell libraries (> 20 kb), selected using a BluePippin System (Sage Science, Inc., Beverly, MA, USA), were prepared according to standard manufacture protocols. Libraries were sequenced using P6-C4 chemistry on either the RS II or Sequel sequencing instruments (Pacific BioSciences, Menlo Park, CA, USA; Additional file 13: Table S6). Sequencing was performed for A. atlantica at the DNA Sequencing Center (DNASC) at BYU (Provo, UT, USA) and at RTL Genomics (Lubbock, TX), while the sequencing for A. eriantha was performed at the BYU DNASC. For the diversity panel and for whole-genome polishing, extracted DNA was sent to the Beijing Genomic Institute (BGI; Hong Kong, China) for 2 × 150 bp paired end (PE) sequencing from standard 500-bp insert libraries. Trimmomatic v0.35  was used to remove adapter sequences and leading and trailing bases with a quality score below 20 or average per-base quality of 20 over a four-nucleotide sliding window. After trimming, any reads shorter than 75 nucleotides in length were removed.
RNA-Seq and transcriptome assembly
For A. atlantica, RNA-Seq data consisted 2 × 100 bp PE Illumina reads derived from 11 different plant tissue types including, stem, mature leaf, stressed mature leaf, seed (2 days old), hypocotyl (4–5 days old), root (4–5 days old), vegetative meristem, green grain, yellow grain, young flower (meiotic), and green anthers (Additional file 14: Table S7). For A. eriantha, 2 × 150 PE RNA-Seq data was generated by BGI from six tissue sources, including young leaf, mature leaf, crown, roots, immature panicle, and whole seedling, harvested from plants grown hydroponically in 1× Maxigro™ (GH Inc., Sebastopol, CA, USA) in growth chambers maintained at 21 °C under broad-spectrum halogen lamps, with a 12-h photoperiod at BYU (Additional file 14: Table S7). The resulting reads were trimmed using Trimmomatic , then aligned to either the A. atlantica or A. eriantha reference using HiSat2 v2.0.4  with default parameters and max intron length set to 50,000 bp. Following alignment, the resulting SAM file was sorted and indexed using SAMtools v1.6  and assembled into putative transcripts using StringTie v1.3.4 . The quality of the assembled transcriptome was assessed relative to completeness using BLAST comparisons to the reference Brachypodium distachyon L. (ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/brachypodium_distachyon/pep/).
Genome size, assembly, polishing, and scaffolding
Genome size was estimated using Jellyfish  and GenomeScope v1.0  at k-mer length = 21 for each species. Initial assemblies were done using Canu v1.7  with default parameters (corMhapSensitivity = normal and corOutCoverage = 40). The resulting assemblies were polished using Arrow from the GenomicConsensus package in the Pacific BioSciences SMRT portal v5.1.0 and PILON v0.22  using Illumina short reads. Chicago® and Hi-C proximity-guided assemblies were performed by Dovetail Genomics LLC (Santa Cruz, CA, USA) to produce chromosome-scale assemblies. Fresh leaf tissue from a single dark-treated (72 h), 3-week-old plant, derived directly from selfing of the original A. atlantica and A. eriantha plants, was sent to Dovetail Genomics for Chicago® and Hi-C library preparation as described by Putnam et al.  and Lieberman-Aiden et al. , respectively, using the DpnII restriction endonuclease. The libraries were sequenced using a standard Illumina library prep followed by sequencing on an Illumina HiSeq X in rapid run mode. The HiRiSE™ scaffolder and the Chicago® and Hi-C library-based read pairs were used to a produce a likelihood model for genomic distance between read pairs, which was used to break putative miss-joins and to identify and make prospective joins in the de novo Canu assemblies.
Repeat analysis, genome completeness, and annotation
RepeatModeler v1.0.11  and RepeatMasker v4.0.7  were used to quantify and classify repetitive elements in the final assemblies, relative to RepBase libraries v20181026; www.girinst.org). Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0.2  was employed to assess the completeness of the assembly using the Embryophyta odb9 dataset and the “—long” argument, which applies Augustus  optimization for self-training.
MAKER2 v2.31.10 [56, 57] was used to annotate the final genomes. Expressed sequence tag evidence for annotation included the de novo transcriptomes for each species as well as the cDNA models from Brachypodium distachyon L. (v 1.0; Ensembl genomes). Protein evidence included the uniprot-sprot database (downloaded September 25, 2018) as well as the peptide models from B. distachyon (v 1.0; Ensembl genomes). Repeats were masked based on species-specific files produced by RepeatModeler. For ab initio gene prediction, A. atlantica and A. eriantha species-specific AUGUSTUS gene prediction models were provided as well as rice (Oryza sativa)-based SNAP models.
Variant identification and tree creation
Single nucleotide polymorphisms (SNPs) for the diversity panel were identified from the Illumina reads by mapping the A-subgenome and C-subgenome diploid accessions against the A. atlantica and A. eriantha reference genome assemblies, respectively, using BWA-mem v0.7.17 . Output SAM files were converted to BAM files and sorted using SAMtools v1.6 , and indexed using Sambamba v0.6.8 . InterSnp, an analysis tool from the BamBam v1.4 package , was used to call SNPs with the arguments -m 2 and -f 0.35. Bash scripting was used to removed SNPs with less than 100% genotype calls across all accessions (i.e., no missing data) or given the cleistogamous nature of the species where 5% or more of the accessions were called as heterozygotes. SNPs on unscaffolded contigs were also removed prior to phylogenetic analysis. SNPhylo v20160204 , which uses MUSCLE  for sequence alignments and linkage disequilibrium to down sample the SNP dataset, was used to build Phylogenies with the bootstrapping parameter set to 1000. The resulting tree was visualized using FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree).
Genomic comparisons, including calculations of synonymous substitutions per synonymous sites (Ks) and homology searches for syntenic gene-sets with Hordeum vulgare L. (CoGe genome id52970 ), Oryza sativa L. (CoGe genome id34910 ), Zea mays L. (CoGe genome id33766 ), and B. distachyon (CoGe genome id52735 Vogel ) were accomplished using the DAGchainer output file from the CoGe (https://genomevolution.org/coge/) SynMap tool.
We gratefully acknowledge David A. Kudrna (Arizona Genomics Institute, Tucson, AZ, USA) and Edward Wilcox (DNA sequencing Center, Brigham Young University, Provo, UT, USDA) for their assistance and expertise with PacBio sequencing.
PJM, ENJ, and JS conceived and designed the study. RL and MC performed the experiments and managed the plant materials. PJM, RW, RJV, CB, RRR, JJ, and WAB performed the bioinformatic analyses, including de novo genomic and transcriptome assemblies, annotations, and SNP discovery. EJ, NAT, and TL contributed to the comparative genomics and candidate gene discovery for heading date and crown rust. PJM, ENJ, and JS wrote the manuscript. All authors read and approved the final manuscript.
The funding for this research was provided through a grant (#1444575) from the Plant Genome Research Program at the National Science Foundation.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 2.Oliver RE, Tinker NA, Lazo GR, Chao S, Jellen EN, Carson ML, Rines HW, Obert DE, Lutz JD, Shackelford I, et al. SNP discovery and chromosome anchoring provide the first physically-anchored hexaploid oat map and reveal synteny with model species. PLoS One. 2013;8(3):e58068.CrossRefPubMedPubMedCentralGoogle Scholar
- 11.Potter RC, Castro JM, L.C. M: Oat oil compositions with useful cosmetic and dermatological properties in. Edited by States U, vol. US5620692A. United States: GTC OATS Inc 1997.Google Scholar
- 16.Yan H, Bekele WA, Wight CP, Peng Y, Langdon T, Latta RG, Fu YB, Diederichsen A, Howarth CJ, Jellen EN, et al. High-density marker profiling confirms ancestral genomes of Avena species and identifies D-genome chromosomes of hexaploid oat. Theor Appl Genet. 2016;129(11):2133–49.CrossRefPubMedPubMedCentralGoogle Scholar
- 18.Aung T, Chong J, Leggett M. The transfer of crown rust resistance Pc94 from a wild diploid to cultivated hexaploid oat. In: Kema GHJ, Niks RE, Daamen RA (eds) Proc. 9th Int. Eur. Mediterr. Cereal Rusts and Powdery Mildews Conf. Lunteren Netherlands. Wageningen, European and Mediterranean Cereal Rust Foundation. 1996. pp. 167–71.Google Scholar
- 23.Coon MA. Characterization and variable expression of the CslF6 homologs in oat (Avena sp.). Provo: Brigham Young University; 2012.Google Scholar
- 25.Oliver RE, Jellen EN, Ladizinsky G, Korol AB, Kilian A, Beard JL, Dumlupinar Z, Wisniewski-Morehead NH, Svedin E, Coon M, et al. New diversity arrays technology (DArT) markers for tetraploid oat (Avena magna Murphy et Terrell) provide the first complete oat linkage map and markers linked to domestication genes from hexaploid a. sativa L. Theor Appl Genet. 2011;123(7):1159–71.CrossRefPubMedPubMedCentralGoogle Scholar
- 28.Chaffin AS, Huang YF, Smith S, Bekele WA, Babiker E, Gnanesh BN, Foresman BJ, Blanchard SG, Jay JJ, Reid RW et al. A Consensus Map in Cultivated Hexaploid Oat Reveals Conserved Grass Synteny with Substantial Subgenome Rearrangement. Plant Genome. 2016;9(2):1–21.Google Scholar
- 30.Klos KE, Yimer BA, Babiker EM, Beattie AD, Bonman JM, Carson ML, Chong J, Harrison SA, Ibrahim AMH, Kolb FL et al. Genome-Wide Association Mapping of Crown Rust Resistance in Oat Elite Germplasm. Plant Genome. 2017;10(2):1–13.Google Scholar
- 36.Latta RG, Bekele WA, Wight CP, Tinker NA: Comparative linkage mapping of diploid, tetraploid, and hexaploid Avena species suggests extensive chromosome rearrangement in ancestral diploids. Sci Rep 2019, In Press.Google Scholar
- 37.Tinker NA, Bekele WA, Hattori J. Haplotag: Software for Haplotype-Based Genotyping-by-Sequencing Analysis. G3-Genes Genom Genet. 2016;6(4):857–63.Google Scholar
- 60.Mizuno H, Kawahara Y, Wu JZ, Katayose Y, Kanamori H, Ikawa H, Itoh T, Sasaki T, Matsumoto T. Asymmetric distribution of gene expression in the centromeric region of rice chromosome 5. Front Plant Sci. 2011;2(16)1–12.Google Scholar
- 61.Philippe R, Paux E, Bertin I, Sourdille P, Choulet F, Laugier C, Simkova H, Safar J, Bellec A, Vautrin S et al: A high density physical map of chromosome 1BL supports evolutionary studies, map-based cloning and sequencing in wheat. Genome Biol. 2013;14(6):1–22.Google Scholar
- 63.Melters DP, Bradnam KR, Young HA, Telis N, May MR, Ruby JG, Sebra R, Peluso P, Eid J, Rank D et al: Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 2013;14(1):1–20.Google Scholar
- 65.Gan X, Hay A, Kwantes M, Haberer G, Hallab A, Dello Ioio R, Hofhuis H, Pieper B, Cartolano M, Neumann U et al: The Cardamine hirsuta genome offers insight into the evolution of morphological diversity. Nat Plants. 2016;2(11):1–6.Google Scholar
- 78.Schubert M, Marcussen T, Meseguer AS, Fjellheim S. The grass subfamily Pooideae: cretaceous-Palaeocene origin and climate-driven Cenozoic diversification. Glob Ecol Biogeogr. 2019;28(8):1168–82.Google Scholar
- 84.Jellen EN, Gill BS, Rines HW, Fox SL, Wilson WA, McMullen MS. Translocations in current and ancestral spring and winter oat accessions. In: 1996 Agronomy abstracts, vol. 1996. Madison: Agronomy Society of America. p. 78.Google Scholar
- 91.Simmons MD. The Cereal Rusts Vol II: Diseases, distribution, epidemiology and control. Orlando: Academic Press; 1985.Google Scholar
- 92.Li PC, Quan XD, Jia GF, Xiao J, Cloutier S, You FM. RGAugury: a pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants. BMC Genomics. 2016;17(852):1–10.Google Scholar
- 103.Sambrook J, Fritsch EF, Maniatis T. Molecular cloning: A laboratory manual. 2nd ed. Cold Spring Harbor: Cold Spring Harbor Laboratory Press; 1989.Google Scholar
- 112.Smit A, Hubley, R: RepeatModeler Open-1.0. 2008-2015, <http://www.repeatmasker.org>. Accessed 22 Apr 2019.
- 113.Smit AFA, Hubley R, Green P: RepeatMasker Open-4.0. 2013-2015 <http://www.repeatmasker.org>. Accessed 22 Apr 2019.
- 115.Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 2013, preprint arXiv:1303.3997.Google Scholar
- 118.Mayer KF, Waugh R, Brown JW, Schulman A, Langridge P. A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012;491.Google Scholar
- 119.Du HL, Yu Y, Ma YF, Gao Q, Cao YH, Chen Z, Ma B, Qi M, Li Y, Zhao XF, et al. Sequencing and de novo assembly of a near complete indica rice genome. Nat Commun. 2017;8.Google Scholar
- 122.Maughan PJ, Lee R, Walstead RN, Vickerstaff RJ, Fogarty MC, Brouwer CR, Reid RR, Jay JJ, Bekele WA, Jackson EW et al: Raw sequences used for A. atlantica genome assembly are deposited in the Sequence Read Archive database under the BioProject PRJNA546592. National Center for Biotechnology Information; 2019: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA546592. Accessed 5 Jun 2019.
- 123.Maughan PJ, Lee R, Walstead RN, Vickerstaff RJ, Fogarty MC, Brouwer CR, Reid RR, Jay JJ, Bekele WA, Jackson EW et al: Raw sequences used for A. eriantha genome assembly are deposited in the Sequence Read Archive database under the BioProject PRJNA546595. National Center for Biotechnology Information; 2019: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA546595. Accessed 5 Jun 2019.
- 124.Maughan PJ, Lee R, Walstead RN, Vickerstaff RJ, Fogarty MC, Brouwer CR, Reid RR, Jay JJ, Bekele WA, Jackson EW et al: The raw reads for the resequencing panel of the Avena diploid species are found in BioProject PRJNA556219. National Center for Biotechnology Information; 2019: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA556219. Accessed 23 Jul 2019.
- 125.Maughan PJ, Lee R, Walstead RN, Vickerstaff RJ, Fogarty MC, Brouwer CR, Reid RR, Jay JJ, Bekele WA, Jackson EW et al: Avena atlantica genome and annotation. Comparative Genomics; 2019: https://genomevolution.org/coge/GenomeInfo.pl?gid=53337. Accessed 21 Dec 2018.
- 126.Maughan PJ, Lee R, Walstead RN, Vickerstaff RJ, Fogarty MC, Brouwer CR, Reid RR, Jay JJ, Bekele WA, Jackson EW et al: Avena eriantha genome and annotation. Comparative Genomics; 2019: https://genomevolution.org/coge/GenomeInfo.pl?gid=53381. Accessed 2 Jan 2019.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.