Background

Oat (Avena sativa L.) is a nutritionally important crop throughout the world. It is ranked 6th in world cereal production [1], and while its primary use continues to be as a livestock feed, its uses as a human food and for cosmetics continue to increase [2]. Among the many nutritional benefits of oat are its high levels of calcium, β-glucan soluble fiber [3,4,5,6], and high-quality oil and protein [7, 8]. Oat seed contains no gluten and only low levels of gluten-related prolamins and therefore is a healthy diet alternative for many people who cannot tolerate dietary gluten. Oat is high in polyphenolic avenanthramides having antioxidant, anti-inflammatory, and antiatherogenic properties [9]. Oat also contains two classes of saponins: avenacosides (sugars bound to steroids) and avenacins (sugars bound to triterpenoid), both of which have been shown to lower cholesterol, stimulate the immune system, and have anti-carcinogenic properties [7]. Oat also has many topical uses, as it has a soothing effect on skin and has been used to treat dry, itchy skin [10]; oat has also been shown to have sun-blocking properties [11], and it is often found in products to treat eczema, psoriasis, and other skin conditions [12, 13].

Common oat (A. sativa) and red oat (A. byzantina C. Koch) are allohexaploids (2n = 6x = 42, AACCDD subgenomes) belonging to the Poeae Tribe of the Poaceae [14] and are thought to have been domesticated from wild-weedy A. sterilis L. [15], a species that arose from hybridization between a CCDD allotetraploid closely related to modern A. insularis Ladiz. and an AsAs diploid [16]. Several variants of the A-subgenome diploids exist (Ac, Ad, Al, Ap, and As [17];) and are known to harbor several genetic features of significance, including major crown rust resistance genes that have been transferred into hexaploid oat cultivars [18, 19]. The A-genome diploids have also been identified as potential gene sources for improving soluble fiber and protein [20]. The A-subgenome is also part of a major intergenomic translocation (7C-17A) in A. sativa-A. sterilis that has been associated with adaptation to winter hardiness—key elements in oat production that likely contributed to the plant’s ability to shift from Mediterranean winter ecology to Eurasian spring-summer cultivation [21].

The C-subgenome chromosomes have a high amount of diffuse heterochromatin along their entirety [22]; this is a genetic feature not seen in the A and D chromosomes, where heterochromatin is localized and seemingly concentrated around the centromeres, at the telomeres, and in flanking secondary constrictions where rRNA genes are located. Among the important genetic features in the C-subgenome is a terminal translocation segment on the long arm of 21D which carries a putative CSlF6c locus that likely has a negative effect on seed soluble fiber content [23, 24]. Linkage has also been demonstrated between the chromosome 5C telomeric knob in allotetraploid A. magna Murphy et Terrell (CCDD subgenomes) and co-segregating genes controlling awn production and basal abscission layer formation which have been implicated in the domestication of common oat [25].

Despite the historical importance of oat and the renewed interest in its nutritional value, a complete genome sequence of oat has yet to be reported. Indeed, the A. sativa genome is large (> 12 Gb [26]), duplicated, complex, highly repetitive, and characterized by several major intra- and intergenomic rearrangements—making full genome assembly of the hexaploid difficult [27]. Here we report the development of fully annotated, chromosome-scale assemblies for the extant progenitor species of the As- and Cp-subgenomes, A. atlantica B.R.Baum & Fedak and Avena eriantha Durieu., respectively. Using these assemblies, we (i) identified and quantified repetitive element content in the genome, including centromeric and telomeric repeats, (ii) analyzed syntenic relationships with other cereal grains and homoeologous relationships within oat using consensus linkage maps [28], (iii) identified putative candidate genes for flowering time [29] and crown rust resistance [30] relative to recently published genome-wide association studies (GWAS), (iv) estimated the age of the evolutionary split between the A- and C-subgenomes using synonymous substitution rates (Ks) analysis, and (v) examined the genetic diversity and phylogenetic relationship from a resequencing panel of 76 A- and C-genome Avena species.

Results

Whole-genome sequencing and assembly

We selected the A. atlantica accession Cc 7277 and the A. eriantha accession CN 19328 for whole-genome shotgun sequencing. Both accessions are highly inbred and phenotypically homogeneous and represent type accessions for their respective species. A total of 31,544,396 and 28,257,346 PacBio reads were generated across 122 (RSII and Sequel) and 54 (Sequel) SMRT cells generating a total of 325.9 (~84× coverage) and 276.6 (~71× coverage) Gb of sequence data for A. atlantica and A. eriantha, respectively. The longest reads for each species, 194,884 and 151,576 bp, came from the Sequel instrument. The N50 read length for A. atlantica and A. eriantha was 18,658 and 15,102 bp, respectively. In addition to PacBio sequencing, a total of 192 Gb for A. atlantica and 40 Gb for A. eriantha of 2 × 150 bp Illumina sequences were generated. A k-mer analysis (at k = 21 scale) using Genoscope [31] predicted a genome size of 3.72 Gb with 0.07% heterozygosity and a repeat fraction of 78% for A. atlantica and a genome size of 4.17 Gb with a 0.12% heterozygosity and a repeat fraction of 76% for A. eriantha. The relative magnitude of these values agree well with those reported by Bennett and Smith [26] and Yan et al. [32], both of which report that the genomes of the A-genome diploids are ~ 15% smaller than the C-genome diploids. However, former estimates determined by replicated flow cytometry measurements ranged in size from 4.1 to 4.6 Gb for A-genomes and from 5.0 to 5.1 Gb for C-genomes [32]. The differences in genome size predicted by k-mer vs. flow cytometry analyses are likely a reflection of the significant repeat fraction in the oat genome that is difficult to account for using a k-mer analysis.

Prior to Hi-C scaffolding, Canu was used to assemble the A. atlantica and A. eriantha PacBio long reads into 3914 and 8067 contigs with an N50 of 5,544,947 and 1,385,002 bp, spanning a total of 3.68 and 3.77 Gb of total length, respectively (Table 1). The L50 of the assemblies were 196 and 797 and the longest contigs spanned 25,143,700 and 10,103,775 bp, respectively. The average G+C content of the assemblies were 44.4% and 43.9%, which is similar to most monocotyledonous cereals (e.g., Sorghum bicolor, 43.9%; Oryza sativa, 43.6% G+C) but significantly higher than G+C content predicted for dicots, which typically range between 33 and 36% (e.g., Carica papaya, 34%; Arabidopsis thaliana, 36%) [34]. As these were PacBio read based assemblies, no “N” gaps were present in the Canu assemblies.

Table 1 Summary statistics for the canu [33] and Hi-C assemblies for A. atlantica and A. eriantha

To improve the Canu assemblies, contigs were further scaffolded by chromatin-contact maps using DoveTail Chicago® and Hi-C libraries. Chicago® library contact maps are based on purified DNA that is reconstituted in vitro and thus limited to chromatin associations no larger than the size of the purified input DNA fragments (< 100 kb). Nonetheless, they are ideal for detecting and correcting miss-joins in de novo assemblies as well as short-range scaffolding [35]. Approximately 73× coverage of 1–100 kb read pairs (2 × 150) were generated from Chicago® libraries for each Avena species and used to scaffold the Canu assemblies using the HiRiSE™ scaffolder. In total, 334 and 158 breaks were made, while 1157 and 2962 joins were made in the A. atlantica and A. eriantha assemblies, respectively. The net effect of these changes was to decrease the number of total scaffolds to 3118 in the A. atlantica assembly and to 5263 in the A. eriantha assembly, which was accompanied by a slight decrease in N50 (4310 and 1.314 kb, respectively) for each assembly. Whenever a join was made between contigs, an “N” gap, consisting of 100 Ns, was created. The total percent of the genome in gaps, for both species, was less than 0.1%.

The Chicago®-based assemblies were then further scaffolded using in vivo Hi-C libraries, created from native chromatin to produce ultra-long-range mate pairs. Mate pair reads (10–10,000 kb) representing a physical coverage of 2749× and 513× were generated for the A. atlantica and A. eriantha genomes and scaffolded using the HiRiSE™ scaffolder. In total, 922 joins and 2614 joins (plus three breaks) in the A. atlantica and A. eriantha were made, respectively, producing ultra-long scaffolds, putatively representing full-length chromosomes and/or chromosome arms. The HiRiSE™ assembly for A. atlantica had a scaffold N50 of 513.2 Mb, and an L50 of 4, spanning a total sequence length of 3.685 Gb. The longest scaffold spanned 577.8 Mb. Similarly, the A. eriantha assembly had a scaffold N50 of 534.8 Mb, an L50 of 4, and spanned a total sequence length of 3.778 Gb with the longest scaffold reaching 588.3 Mb. Scaffold joins produced by the Hi-C mate pairs introduced new “N” gaps in the assembly (each consisting of 1000 Ns), thereby increasing the number of gaps in the assembly to 2079 and 5576 for A. atlantica and A. eriantha, respectively. The final percentage of “N” nucleotides in the final assemblies was less than 0.1%, with the average gap size of 600 and 578 bp, respectively (Table 1).

The longest eight scaffolds of the A. atlantica assembly, presumably representing two chromosome arms (205 and 278 Mb) and six full-length chromosomes (448–577 Mb), consisted of > 96% of the total sequence length from the Canu assembly. Similarly, the longest seven scaffolds, ranging in size from 455 to 588 Mb, presumably representing each of the seven haploid A. eriantha chromosomes, were composed of > 97% of the total Canu assembly sequence. For simplicity, scaffolds representing each of the seven chromosomes from each species are referred to forthwith by size (longest to shortest) as AA1-AA7 and AE1-AE7. The scaffolds in the A. atlantica and A. eriantha assemblies that remain unintegrated into one of the chromosome-scale pseudomolecules are relatively small and repetitive, with an average size of 61 and 38 kb, which likely contributed to the inability of the proximity-guided assembler to confidently place these contigs within the framework of the chromosomes—specifically the low number of interactions on a short fragment as well as the inability to discern interaction distance differences over the short molecule.

Chromosome arm merging

We compared the A. atlantica assembly with a recently published genetic linkage map, constructed from an F6:8 recombinant inbred line population generated from a cross of A. strigosa x A. wiestii, both AsAs Avena diploid species [36]. This map was based on 11,455 ordered, codominant 64-base tag-level haplotypes on seven linkage groups generated using the Haplotag pipeline [37]. Of these, 4551 haplotypes had perfect matches to single sites on the eight largest scaffolds. A clear one-to-one correspondence between linkage groups (LG) and physical assembly scaffolds was observed (Fig. 1), with greater than 97% of the tag-level haplotypes mapping to a specific scaffold derived from a single LG. For example, of the 846 tag-level haplotypes mapping to scaffold ScoFOjO_324_449 (AA1), 838 (> 99%) were derived from LG 7 (Table 2). Of the 464 tag-level haplotypes derived from LG 2, 378 mapped to scaffold ScoFOjO_1310 (278 Mb) and 85 mapped to scaffold ScoFOjO_1577 (205 Mb), indicating that these two smaller scaffolds should be merged to produce a single, full-length pseudo chromosome (AA5; 485 Mb), thus completing the assembly of seven full-length haploid chromosomes for A. atlantica. A head-to-tail merging of these chromosome arms (separated by 1000 Ns) was determined based on the collinearity of the tag-level haplotypes with respect to their orientation within the linkage group. A near perfect collinear relationship was observed between the linkage map and the physical map for all chromosome-linkage group comparisons, with the exceptions being the anticipated reductions of linkage distances relative to physical distances observed at the pericentromeric regions of each chromosome (Fig. 1). It is well documented that recombination is suppressed in centromeres at rates ranging from fivefold to greater than 200-fold, depending on the species [38, 39]. Of the 2188 contigs that were unintegrated into an A. atlantica chromosome using the Hi-C data, we identified segregating haplotypes linked to 22, spanning a total length of 1.07 Mb, which could tentatively place them into the context of the seven haploid chromosomes based on their linkage position (Fig. 1).

Fig. 1
figure 1

Correlation between the physical and linkage map. The genetic position of mapped markers is plotted as a function of physical distance relative to the A. atlantica genome assembly. The linkage position of six unassigned scaffolds with multiple mapping markers is shown

Table 2 Physical map and Linkage map assignment. Haplotag markers from the consensus map of Latta et al. [36] where used to assign scaffold assemblies to linkage groups. Two scaffolds mapped to LG 2 and were merged

Analysis of repetitive elements

The repeat fraction of the Avena genome assemblies was identified and annotated using RepeatModeler and RepeatMasker. In total, ~ 83% of each genome was classified as repetitive, with the most commonly identified repetitive elements being classified as long terminal repeat retrotransposons (LTR-RTs); LTR-RTs are the most abundant genomic components in flowering plants [40, 41], and their abundance is strongly correlated with genome size [42]. Within published plant genomes, repeat content varies widely, ranging from 3% for the minute 82 Mb genome of Utricularia gibba L. [43] to 85% for maize [44]. Given the large size of these genomes, it is not surprising that < 20% of the genome is classified as non-repetitive.

Of the various LTR-RT present (Additional file 1: Table S1), Gypsy-like and Copia-like elements represent > 60% of each genome, in a ratio of 2.3:1 and 3.5:1 for the A. atlantica and A. eriantha genomes, respectively, which is similar to the ratios reported for other Poaceae species (e.g., rice, 4.9:1 [45]; sorghum, 3.7:1 [46]; and maize, 1.6:1, [47]). The next most common element was class II CMC-EnSpm DNA transposons, representing ~ 5% of each genome—which are known common features of the cereals [48, 49]. Interestingly, a significant proportion (A. atlantica: 10.6% and A. eriantha: 14.3%) of the interspersed repeat fraction for each genome was classified as “unknown”. Given the extensive investigations of repeat elements in the grasses [50,51,52], this unknown fraction likely represents repeat elements unique to Avena and could be invaluable in differentiating the A-, C-, and D-subgenomes in hexaploid oat. For example, Solano et al. [53] reported the identification of a tandem repeat sequence clone (pAm1; GenBank X83958) from Avena murphyi L., an AACC tetraploid, which selectively hybridized to the C-subgenome. A repeat that was highly homologous (E-value 2E-82) to pAm1 was identified by RepeatModeler in A. eriantha, but is missing in the A. atlantica genome (Fig. 2a; Tracks 4 and 5). Similarly, Katsiotis et al. [54] reported the identification of an interspersed repeat (pAvKB26; GenBank AJ297385.1) that selectively hybridized to only the A- and D-subgenomes. This repeat was identified in the unknown repeat fraction of A. atlantica but was missing in the A. eriantha genome (Fig. 2b; Tracks 4 and 5). Repeat content is believed to be an important driver of genome organization and evolution [55] and these data will be important for understanding the overall evolution of common hexaploid oat.

Fig. 2
figure 2

Genome overview of a A. atlantica and b A. eriantha. Track 1 (outside): Chromosome and sizes; Tracks 2: Annotated gene density; Track 3: Centromeric repeat density; Track 4: Telomeric sub-repeat density; Track 5: C-genome specific repeat (pAm1) density; Track 6: A-genome specific repeat (pAvKB26) density

In addition to the interspersed repeat elements, ~ 0.5% of the genome was classified as low complexity, satellite, microsatellite or telomeric repeat (see genomic feature section below). Indeed, 5217 and 3404 putative microsatellite loci were identified, with the most common di-, tri- and tetranucleotide repeat motif identified being (AT)n, (AAC)n or (GGC)n and (TTTA)n, in A. atlantica and A. eriantha, respectively. To date, no microsatellites have been generated specifically for the Avena diploid species – thus these new putative microsatellite loci represent important genetic tools for studying diversity and specifically for advancing breeding in the A-genome diploids.

Transcriptome assembly and functional annotation

The A. atlantica and A. eriantha transcriptomes, which consisted of 51,223 and 47,361 scaffolded isoforms, the Brachypodium cDNA and peptide models (v 1.0; Ensembl genomes) and the uniprot-sprot database were provided as primary evidence for annotation in the MAKER pipeline [56]. The RNA-Seq data mapped with high efficiency to the assemblies, with > 96% of the reads mapping to their respective genome at 93.1% concordance for pair alignment rates, suggestive of high-quality genome assemblies for both species. The MAKER pipeline identified a total of 51,100 and 49,105 gene predictions, with mean transcript lengths of 3018 and 3153 bp, and with 70% and 66% of the annotations having annotation edit distance (AED) measures < 0.25, for A. atlantica and A. eriantha genomes, respectively. AED integrates sensitivity, specificity, and accuracy measurement to calculate annotation quality, where AED values < 0.25 are indicative of high-quality annotations [57]. The mean G+C content of the transcripts in both species was ~ 52%. The increase in G+C content within coding regions relative to the overall G+C content of the genome (~ 44%) is a well-known phenomenon and is hypothesized to be the result of GC-biased gene conversion – a process by which the G+C content of DNA increases due to gene conversion during recombination [58].

The completeness of the gene space was quantified using BUSCO which provides a quantitative measure for genome and transcriptome completeness based on a core set of highly conserved plant-specific single-copy orthologs [59]. Of the 1440 plant-specific orthologs, 1387 (96.3%) were identified in the A. atlantica genome assembly as full length, while 1395 (96.9%) were identified in the A. eriantha assembly as full length, suggesting high-quality and complete genome assemblies. As expected for diploid species, the level of gene duplication, as identified by BUSCO for the conserved orthologous genes was low for both species (2.2% and 2.3%). Similarly, a BUSCO analysis of the transcript and protein annotation sets produced by MAKER identified similar numbers of conserved orthologs for both species, which is indicative of a successful annotation process.

Genomic features

Pericentromeric regions, associated with reduced recombination relative to physical distance, were evident from the linkage and physical map comparison in A. atlantica (Fig. 1). The observation that gene density is substantially reduced provided further evidence that that these regions represented centromeric regions in both species, as has been well-documented previously in other eukaryotes [60, 61]. Centromeres in most plant species are complex but are dominated by megabase-sized arrays of tandemly arranged monomeric satellite repeats. While complex and highly diverse among plant species, they commonly share a unit length ranging between 150 and 180 bp, which is close to the size of the nucleosome unit [62]. Melters et al. [63] showed that due to the relative size of the centromere, the most common repeat found in whole-genome sequencing data is the putative centromeric repeat. Using the output of RepeatModeler from A. eriantha, we identified a high-copy-number 159 bp tandem repeat that aligned specifically with the putative centromere location in each of the A. atlantica and A. eriantha chromosomes (Fig. 2; Additional file 2). Although the 159 bp repeat is similar in size to the putative centromeric repeats found in other grass species (e.g., B. distachyon, 156 bp; H. vulgare, 139 bp; Oryza brachyantha A.Chev. & Roehr., 154 bp; Z. mays, 156 bp), not surprisingly it shares little sequence similarity with the centromeric repeats of those species. Indeed, centromeric repeats exhibit little to no evidence of sequence similarity beyond ~ 50 million years of divergence [63]. As has been documented in other plant species, these putative centromeric repeats span a large portion (often > 50 Mb) of the A. atlantica and A. eriantha chromosomes, suggesting the presence of large pericentromeric heterochromatic regions [64, 65]. Moreover, the positioning of the centromeres, as defined by this putative repeat and the gene density plots, is consistent with the cytological positioning of the centromere, which suggests that the centromeres in A. atlantica are almost all metacentric to submetacentric, while the centromeres in A. eriantha are almost all sub-telocentric [66, 67]. Indeed, per our analyses, we identified three metacentric, two submetacentric and two sub-telocentric chromosomes in A. atlantica, and five sub-telocentric, one submetacentric and one metacentric chromosome in A. eriantha (Fig. 2).

Repeatmodeler annotated a putative telomeric satellite sequence (665 bp) for A. eriantha (Additional file 1: Table S1). A homologous sequence (639 bp) with significant homology (E-value = 0.0) and alignment identity (Identity = 80%; Gap = 6%) was identified from the repeat sequences identified by RepeatModeler (Additional file 2). BLAST searches of the assemblies with their respective satellite telomeric repeat sequence identified enriched regions of the telomeric repeat on all seven of the chromosomes for each species (Fig. 2). In A. atlantica, telomeric satellite sequences were located toward the distal end of each chromosome; however, in A. eriantha the location of the sequence is more dispersed, being found primarily at the end of the chromosomes on ten of the 14 chromosome arms, while in a few instances being interspersed interstitially. Interstitial telomere-like repeats have been reported in several plant species including Anthurium, Vicia, Sideritis, Typhonium, and Pinus where they were implicated in chromosome rearrangements, including inversions, translocations, and chromosome fusions [68,69,70,71]. While chromosomal rearrangements are common in Avena, we caution that the very repetitive nature of telomeric sequences makes them susceptible to collapse during the assembly process and are thus inherently difficult to order and orient in the Hi-C scaffolding process.

SNP discovery and genetic diversity

To characterize the diversity and phylogenetic relationships among Avena A- and C-genome diploid species, we resequenced at 10X coverage 61 A-genome diploid accessions (including A. atlantica, A. brevis, A. canariensis, A. damascena, A. hirtula, A. longiglumis, A. lusitanica, A. strigosa, A. strigosa-brevis, A. strigosa-hispanica, A. strigosa-nuda, and A. wiestii) and 10 C-genome diploids (A. clauda, A. eriantha, A. ventricosa; Additional file 3: Table S2). The resequencing produced 40 Gb sequence data per accession (Additional file 3: Table S2). The A-genome accession reads were mapped against the A. atlantica genome, while the C-genome species were mapped against the A. eriantha genome for SNP discovery using InterSnp [72]. InterSnp uses BAM files to call SNPs between samples based on consensus alleles at each genomic position, filtered to produce a dataset with 0% missing data across all lines. Considering the cleistogamous nature of the accessions included, any SNPs with > 5% heterozygous calls were deemed likely to result from spurious read-mapping and were removed from the dataset. Using a minimum allele frequency threshold of < 0.1, a total of 286,567 and 3,185,959 putative SNPs were identified within the A-genome and C-genome diploid datasets, respectively, and used by SNPhylo [73] to investigate phylogenetic relationships. SNPhylo reduces oversampling effects at linked SNPs using an LD threshold (0.1) with a sliding window of 500,000 base pairs. Thus, a total of 7221 and 11,530 SNPs, with an average of 1032 and 1647 SNPs per chromosome, were selected prior to tree construction for the A-genome and C-genome diploids dataset, respectively (Additional file 4: Table S3).

The bootstrapped maximum likelihood phylogenetic trees were rooted with either the A. eriantha accession CN 19328 for the A-genome accessions tree (Fig. 5a) or with the A. atlantica accession Cc 7277 for the C-genome accessions tree (Fig. 5b). The A-genome diploids formed two distinct clades: one of these consisted primarily of accessions classified in taxa having the AsAs subgenome, which had previously been described by Rajhathy and Morrison [74] and Leggett [75] as including A. atlantica, A. hirtula, the domesticated forms of A. strigosa, and A. wiestii; and a second clade comprised mostly of A. canariensis (AcAc), Syrian accessions of A. damascena (AdAd), A. longiglumis (AlAl), and three floret-shattering accessions that were possibly misidentified as A. hirtula and A. lusitanica. As expected, the spikelet-shattering A. atlantica occupied the basal position on the AsAs branch of the tree, while all of the A. strigosa (domesticated AsAs) genotypes formed a clade at the top of the tree and included a single accession of weedy A. wiestii (CIav 1994) that, upon inspection of the panicles, more closely resembled a long-awned, semi-shattering A. strigosa genotype.

The A. strigosa branch shows clearly the effect of a domestication bottleneck. This branch of the tree is subdivided into two distinct sub-branches. The upper sub-branch consists predominantly of genotypes of Iberian origin (i.e., CN 25698, CIav 9019, CIav 9036, etc.) and includes seven homogeneous accessions that are derivatives of the Brazilian ‘Saia’ variety of forage oat (i.e., CIav 7010, PI 291990, etc.). Interestingly, the A. hispanica genotypes form a unitary subclade within this branch that is strongly supported by the bootstrap value. The lower sub-branch is comprised of accessions from outside the Iberian Peninsula (PI 83721, PI 287314, PI 304557, CIav 9022, etc.) and includes all of the A. strigosa-nuda varieties. Since A. brevis strains are distributed in both branches, there is no evidence to confirm its identity as a distinct taxon within or apart from A. strigosa. The presence of branches containing multiple, genetically indistinct accessions indicates there is a high degree of duplication being curated within the USDA and PGR-Canada gene banks for A. strigosa.

The remainder of the A-genome tree consists of entirely wild genotypes. Avena lusitanica is not a universally accepted taxon, and the presence of these strains on various branches of the tree confirms that this is not a valid independent taxonomic entity; instead, it is part of the floret-dispersing A. hirtula-wiestii complex of semi-desert and Mediterranean scrub ecotypes of the AsAs biological species complex. The presence of three floret-shattering accessions from Morocco that were previously identified as A. damascena (AdAd) on the AsAs branch (PI 657468, PI 657471, PI 657472) and the two Syrian A. damascena genotypes on the other branch (CN 19457, CN 19459) suggests that the Moroccan group are misidentified and are therefore, like A. lusitanica, either members of the AsAs A. hirtula-wiestii-atlantica-strigosa complex or, possibly, misclassified accessions of tetraploid A. barbata.

The rooted C-genome tree had the lone A. ventricosa (CvCv) accession at the base of the C-genome branch that consisted of two subclades. The more basal branch consisted of accessions of spikelet-shattering A. eriantha (CpCp) from Algeria (CN 24022) and four samples of A. eriantha from an extended population growing in the Middle Atlas Mountains of Morocco (PI 657575–8). The other branch included Algerian (CN 24040) and Turkish (CN 19238) accessions of floret-shattering A. clauda (CpCp) along with Iranian (CN 19256) and Algerian (CN 19328, the reference genome) A. eriantha genotypes.

Discussion

Comparative genomics

The age of the Poaceae family has been difficult to establish, with varying ages reported in the literature [76, 77]. Schubert et al. [78] recently reported the use of newly available paleobotanical fossils to established the age of the family to be approximately 120.8 million years ago (Ma), with the split of the Aveneae, Brachypodieae, and BOP clades occurring approximately 44.3, 51.8, and 80.2 Ma, respectively - suggesting that the grasses have a lower nucleotide substitution rate than the other angiosperms [79]. We calculated the rate of synonymous nucleotide substitutions per synonymous site (Ks) for orthologous gene pairs between the A. atlantica and A. eriantha assemblies using the CodeML [80] tool on the CoGe platform (genomevolution.org/coge). A total of 18,002 duplicate gene pairs were identified with a clear peak seen at Ks = 0.0875 (Additional file 5: Figure S1). From the node estimates reported by Schubert et al., we calculated an average substitution rate for the Pooideae lineage of 3.39E-09, suggesting that speciation between A. atlantica and A. eriantha occurred between 5.4–12.9 million years ago (Ma), depending on whether a core eukaryotic-based synonymous mutation rate or the calculated lineage specific rate for Pooideae was used in the calculation [81, 82]. As seen in the SynMap dotplot alignment (Additional file 6: Figure S2), significant synteny was observed between the Avena chromosomes consisting of 187 syntenic blocks with 21,021 collinear genes pairs (112 genes/block) with 98.2% coverage across both the A. atlantica and A. eriantha genomes. As expected, given the relatively close ancestry of the species, the size (bp) of the syntenic blocks between species was highly correlated (R2 = 0.88; Additional file 6: Figure S2C). The large blocks of syntenic genes are suggestive of orthologous relationships between the chromosomes of the species (Additional file 6: Figure S2A). For example, slightly more than 77% (349 Mb) of the syntenic sequence found on AA2 is derived from AE5, suggesting that they are orthologs. Indeed, using a majority rule (> 50% syntenic sequence) we identified the following orthologous chromosome pairs: AA1 + AE6 (61%; 248 Mb); AA2 + AE5 (77%; 349 Mb); AA3 + AE3 (74%; 318 Mb); AA4 + AA1 (71%; 271 Mb); AA7 + AE2 (57%; 274 Mb); with AA5 and AA6 sharing orthology with several A. eriantha chromosomes (Additional file 6: Figure S2B).

The Poaceae family consists of many agronomically important species, commonly referred to as cereals, that are found in three subfamilies: Oryzoideae (rice), Panicoideae (maize, sorghum) and Pooideae (wheat, barley, oat and rye). Pooideae forms 14 tribes, including the tribes Brachypodieae, Poeae (syn Aveneae, including oat) and Triticeae (barley, rye, and wheat), with Poeae and Triticeae tribes having separated ~ 49 Ma [78]. This agrees well with the Ks analyses presented here for A. atlantica and A. eriantha and with the published Hordeum vulgare genome [83], which both show a clear peak at 0.3 – suggestive of a divergence time of 44 Ma (per the calculated lineage specific rate for Pooideae). As expected, the Ks analyses from the Avena comparisons with the B. distachyon genome (International Brachypodium Initiative, 2010) suggested a more distant divergence of 47–51 Ma for the split of the AvenaBrachypodium lineages (Additional file 5: Figure S1).

SynMap was also used to investigate syntenic relationships between the Avena and Hordeum chromosomes (Additional file 7: Figure S3 and Additional file 8: Figure S4). Although more syntenic blocks (719 and 714) were identified in the Avena–Hordeum comparisons, they were smaller – consisting of ~ 8.5 genes/block, accompanied by a lower syntenic block size correlation (R2 = 0.35 and 0.41; Additional file 7: Figure S3C and Additional file 8: Figure S4C). The decrease in block size and correlation is reflective of the more distant evolutionary relationship between the species. Nonetheless, the shared ancestry between the two Pooideae species was evident as seen by substantial synteny observed across all seven Avena–Hordeum chromosomes comparisons (Additional file 7: Figure S3A and Additional file 8: Figure S4A). As expected, large, proximal, non-syntenic regions were observed in regions corresponding to putative centromeres where gene density is substantially reduced [60, 61]. The synteny observed among the Avena and Hordeum chromosomes suggests several homeologous relationships. For example, Hordeum chromosome 1H is clearly orthologous with Avena chromosome AA2 and AE5. Indeed, of the syntenic sequence on 1H, 99% (116 Mb) was syntenic to AA2 and 85% (105 Mb) syntenic to AE5 – which is not surprising since we previously showed that AA2 and AE5 are orthologs (see above). Using a simple majority rule (> 50% syntenic sequence) the following are putative HordeumAvena orthologs: 1H + AA2/AE5; 2H + AA5/AE4; 3H + AA3/AE3; 6H + AA7/AE2; and 7H + AA1/AE6. The specific A. atlantica orthologs of 4H and 5H are likely AA6 and AA4, respectively; however, rearrangements obscure the likely orthologs for A. eriantha (Additional file 7: Figure S3B and Additional file 8: Figure S4B).

Bekele et al. [29] recently published a high-density, tag-level haplotype linkage map of hexaploid oat (A. sativa). This consensus linkage map increased the marker density of the former consensus map [28] consisting of 21 well-formed linkage groups, putatively corresponding to each of the 21 hexaploid oat chromosomes. To identify the ancestral subgenome groups (A-, C- and D-) for each of the 21 linkages groups we mapped the haplotag markers to both the A. atlantica and A. eriantha genomes. To avoid false hits, which is particularly problematic due to the highly repetitive nature of the oat genomes, only BLAST hits with perfect identity across the entirety of the marker sequence (e.g., zero gap openings and mismatches) were retained for downstream analyses. In total, 2119 and 969 tag-level haplotypes were mapped to the A. atlantica and A. eriantha genomes, respectively. The increased number (~2X) of tag-level haplotypes mapping successfully against the A. atlantica genome was expected since many D-subgenome haplotypes would map against the A-genome diploid, given the close phylogenetic relationship of these two subgenomes [16]. Indeed, close inspection of the mapping showed that in nearly all cases, tag-level haplotypes mapping to a specific A. atlantica chromosome were derived from two separate consensus linkage groups - presumably corresponding to homoeologs derived from A- and D-subgenomes. For example, 322 tag-level haplotypes mapped to chromosome AA1, with 153 (48%) derived from linkage group Mrg12 and 111 (35%) derived from linkage group Mrg02, which were previously identified as being derived from the A-, and D- subgenomes [16] (Table 3A). Other homoeologous chromosome pairs between the A- and D-subgenome included: Mrg33/Mrg08, Mrg18/Mrg01, Mrg05/Mrg04, Mrg24/Mrg06, Mrg23/Mrg11, Mrg12/Mrg02, and Mrg20/Mrg21. Similar mapping of the tag-level haplotypes against the A. eriantha genome elucidated linkage groups Mrg13, Mrg03, Mrg15, Mrg17, Mrg19, Mrg09 and Mrg 11 as being derived from C-subgenome (Table 3B). Interestingly, Mrg18, which we previously designated as an A-subgenome derived linkage group also showed substantial mapping to the C-genome chromosome AE7 – suggesting that this Mrg18 is actually derived from an A-subgenome/C- subgenome (A/C) intergenomic reciprocal translocation. This is a well-documented reciprocal translocation, first reported by Jellen et al. [84] where it was identified as 7C-17A. Other identifiable rearrangements include D- subgenome and C-subgenome (D/C) intergenomic exchanges on Mrg06/Mrg13, Mrg08/Mrg03, and Mrg19/Mrg28 (Table 3).

Table 3 Ancestral subgenome groups (A-, C- and D-) designation for each of the 21 consensus linkages groups reported for A. sativa [29]. Haplotag markers mapping to (A) A. atlantica and (B) A. eriantha chromosomes, where highest haplotag mapping are colored red and transition to white as the number of haplotags mapping decreases

Utility of the genome assemblies

Given the genetic complexity of polyploid species, diploid species have frequently been used as simplified genetic models [85,86,87]. We show the value of these diploid assemblies using published genome wide association studies (GWAS) for heading date and crown rust resistance - both major breeding targets for common oat. Heading date (flowering time) is critically important for regional adaptation, photosynthetic efficiency, and stress avoidance; and through these factors it strongly influences overall yield [88]. A Haplotag-based GWAS of heading date in the CORE diversity panel (n = 635) of common hexaploid (AACCDD) oat identified two major associations on linkage groups Mrg02, at position 34 cM in eight field trails and on Mrg12 at position 40–42 cM in seven field trials [29]. Interestingly, our comparative analysis (see above) suggests that Mrg02 and Mrg12 are homoeologous (Table 3A), with Mrg12 and Mrg02 being derived from the A-subgenome and D-subgenome, respectively. BLAST searches against the A. atlantica genome assembly using the maker sequences associated with heading date on Mrg12 localized the heading date quantitative trail loci (QTL) to chromosome AA1, at an interval spanning bases 548,905,448 – 553,755,648. A total of 175 annotated gene sequence are found within this region, including a likely candidate gene at the center of this interval, AA006173 (Fig. 3a; 550,704,569–555,704,964), which is annotated as being homologous to HD3A (Heading Date 3A) from O. sativa, and is homologous (E-value = 9e-125, Identity = 97%) to the flowering time protein (FT-like; AAZ38709.1). Yan et al. [89] described HD3A as the vernalization gene, VRN3, in wheat and barley. Interestingly, while Mrg02 is likely of a D-subgenome origin, BLAST search of markers associated with heading date from the Mrg02 linkage mapped significantly to the A-genome chromosome AA1 at an interval spanning 550,053,072–550,947,435 bp, only 242,471 bp from the aforementioned HD3A gene, suggesting that the candidate gene for both major QTLs for flowering time are functional homoeologs of the flowering time (FT) HD3A gene in the A- and D-subgenomes (Fig. 3b).

Fig. 3
figure 3

Identification of candidate genes putatively underlying heading date in oats. Candidate gene loci were identified using BLAST searches against the A. atlantica genome assembly using maker sequences associated with heading date QTLs located on the homoeologous linkage groups a Mrg12 and b Mrg02 (Bekele et al. [29]). Markers from both QTLs mapped to the same physical position on chromosome AA1, within an interval containing an FT-like protein (HD3A), suggesting that heading date in modern oat is controlled by two functional homoeologs of the flowering time gene

Crown rust, caused by Puccinia coronata f. Sp. avenae, is the most damaging and widespread disease of oat worldwide [90]. Moderate to severe outbreaks can reduce yield by 10–40% [91]. Klos et al. [30] performed a GWAS of crown rust resistance on elite common oat accessions challenged with multiple P. coronata isolates and identified multiple QTL on 12 linkage groups that were associated with crown rust resistance, several of which were associated with known resistance genes (e.g., Pc91). Resistance gene analogs (RGAs) contain specific conserved domains and motifs that can be used to identify and classify R-genes into four main RGA families: specifically, NBS-encoding proteins, receptor-like protein kinases (RLKs), receptor-like proteins (RLPs), and transmembrane coiled-coil proteins (TMCC). The RGAugury pipeline [92] annotated a total of 1563 (511 NBS, 722 RLK; 120 RLP; 160 TMCC) and 1402 (459 NBS; 654 RLK; 135 RLP; 154 TMCC) RGAs in the A. atlantica and A. eriantha genomes, respectively (Additional file 9: Table S4; Additional file 10: Figure S5). As has been observed in other monocots, no Toll/Interleukin-1 receptor-NBS-LRR R-genes were predicted in either genome, supporting the hypothesis that this class of R-gene evolved in eudicot lineage [93] or were lost during the evolution of the monocots [94]. The RGAs, specifically the NBS-encoding RGAs, cluster primarily in subtelomeric regions (Additional file 10: Figure S5), with clusters identified on almost all chromosomes and often correlated with the mapping position of crown rust QTLs. For example the Pc91 gene, a known seedling resistance gene previously associated with QTL QPc.CORE.18.3 [30] maps, via two SNPs (GMI_ES03_c2277_336 and GMI_ES05_c11155_383), to the A. atlantica chromosome AA2 at positions 510,519,361 and 533,475,317, co-locating with a predicted disease gene cluster (Fig. 4). The closest annotated disease resistance genes to these markers are AA013376, similar to RPH8A (a nonfunctional homolog of rpp8 in Arabidopsis [95]) located at position 510,828,316 and AA014151, similar to RPM1 (a well-documented resistance gene in Arabidopsis [96]) located at 533,698,614 (Additional file 11: Table S5). Both candidate RGAs were identified by the RGAugury pipeline as CC-NBS-LRR containing R-genes. We caution that while these two candidate RGAs are positioned in the immediate vicinity of the associated markers, at least 87 RGA are present at the RGA cluster defined by the QTL. We note that the diagnostic SCAR and DART markers developed for Pc91 also map to this same location (527,126,948 [97, 98];).

Fig. 4
figure 4

Identification of candidate genes putatively underlying crown rust resistance in oats. Candidate gene loci were identified using BLAST searches against the A. atlantica genome assembly using maker sequences associated with crown rust QTLs located on hexaploid A. sativa linkage group Mrg18 reported by Klos et al. [30]. Mrg18 was previously shown to be involved with an intergenomic translocation involving 7C and 17A, corresponding to A. eriantha chromosome AE7 (blue) and A. atlantica chromosome AA2 (red)

Conclusions

Reference-quality, de novo whole-genome sequence assemblies for two highly repetitive ~ 4 Gb Avena diploid species were produced using a hybrid approach involving PacBio long reads, Illumina short reads, and both in vitro and in vivo chromatin-contact mapping. The whole-genome reference assemblies for As- and Cp-genome oat diploids provided for the first time in this paper represent powerful tools for identifying genes that underlie adaptive, disease resistance, and grain-quality traits critical for oat improvement. The utility of these whole-genome references was demonstrated first by analyzing sequences homologous to heading-date QTL-containing regions that were previously identified via GWAS in common hexaploid oat (A. sativa) to find linked candidate genes in A. atlantica and A. eriantha. Additionally, we used these references in successfully identifying RGAs homologous to oat crown rust resistance genes.

Avena atlantica retains a remarkable degree of synteny in comparison with barley while A. eriantha has undergone a relatively greater degree of chromosomal rearrangement, suggesting the presence of an underlying genomic instability in the C-genome diploids. This might be related to the abundant heterochromatin, including the underlying pAm1 repeat motif, distributed throughout the chromosomes of this genome (Fig. 2b, Track 5 [27];). Their genome sequences shed enormous insight into the complex evolutionary processes that have led to the appearance of cultivated diploid, tetraploid, and hexaploid oat going back millions of years. These processes included responses to natural selective events such as the Zanclean Cataclysm ~ 5 Ma and repeated cycles of global climate change characterized by boreal glacial maxima interspersed with humid periods and desertification due to northerly expansions of the Saharan and Arabian Deserts [99,100,101].

We demonstrate that A. atlantica, A. strigosa, and A. wiestii represent multiple ecotypes or subspecies of a single biological species complex sharing the subgenome designation AsAs and distinguished primarily by their seed-dispersal strategies. The phylogeny presented here, which was generated by analyzing thousands of SNPs identified via resequencing of dozens of geographically diverse accessions, clearly illustrates a monophyletic relationship with A. atlantica accessions at the root of the AsAs-genome clade (Fig. 5a). This is further seen in the high degree of synteny and collinearity between the A. atlantica chromosomes and A. strigosa X wiestii linkage groups reported by Kremer et al. [102] (Additional file 12: Figure S6). This result is remarkable, given the high degree of chromosomal rearrangement previously observed among different species and genomes within Avena [27].

Fig. 5
figure 5

Abbreviated maximum likelihood tree generated using a 10,894 SNPs for C-genome diploids rooted to the A. atlantica (AT_Cc7277) reference and b 7221 SNPs for A-genome diploids rooted to the A. eriantha reference (ER_CN 19238). Asterisks denote percentage of 1000 bootstrap replicates that support the topology at 90–100% (gold) and 75–89% (blue). Scale bar represents substitutions per site. Branch labels are based on subgenome composition and, in some cases, diaspore morphology (“floret-shattering,” “spikelet-shattering,” or “cultivated”). Unabbreviated trees are provided as Additional file 15: Figure S7 and Additional file 16: Figure S8

The oat community has struggled without a reference genome for decades. Finally, we have complete references for what are, essentially, all three component genomes of cultivated hexaploid oat and the four known subgenomes of the genus, given the close correspondence between Avena subgenomes A, B, and D. Moreover, once a complete hexaploid reference is available, the utility of these component genomes will increase further, as they will provide a precise roadmap of the structural and functional evolutionary steps that took place in the formation of this unique and important polyploid species.

Methods

Plant material and nucleic acid extraction

For whole-genome assembly, young leaf tissue (~ 14–21 days post emergence), dark treated for 72 h, from A. atlantica (CC7277; T. Langdon, Aberystwyth University, Wales, UK) and A. eriantha (BYU132; EN Jellen, Brigham Young University, Provo, UT) was flash-frozen and sent to the Arizona Genomics Institute (AGI; Tucson, AZ, USA) for high molecular weight DNA extraction. For the diversity panel, DNA from 76 accessions of diploid A- and C-subgenome species (Additional file 3: Table S2) was extracted from 30 mg of freeze-dried leaf tissue using a protocol devised by Sambrook et al. [103] with modifications described by Todd and Vodkin [104]. All plants were grown in greenhouses at Brigham Young University (BYU) using Sunshine Mix II (Sun Gro, Bellevue, WA, USA) supplemented with Osmocote fertilizers (Scotts, Marysville, OH, USA) and maintained at 25 °C under broad-spectrum halogen lamps, with 12-h photoperiods.

DNA sequencing and read processing

For whole-genome sequencing, large-insert SMRTBell libraries (> 20 kb), selected using a BluePippin System (Sage Science, Inc., Beverly, MA, USA), were prepared according to standard manufacture protocols. Libraries were sequenced using P6-C4 chemistry on either the RS II or Sequel sequencing instruments (Pacific BioSciences, Menlo Park, CA, USA; Additional file 13: Table S6). Sequencing was performed for A. atlantica at the DNA Sequencing Center (DNASC) at BYU (Provo, UT, USA) and at RTL Genomics (Lubbock, TX), while the sequencing for A. eriantha was performed at the BYU DNASC. For the diversity panel and for whole-genome polishing, extracted DNA was sent to the Beijing Genomic Institute (BGI; Hong Kong, China) for 2 × 150 bp paired end (PE) sequencing from standard 500-bp insert libraries. Trimmomatic v0.35 [105] was used to remove adapter sequences and leading and trailing bases with a quality score below 20 or average per-base quality of 20 over a four-nucleotide sliding window. After trimming, any reads shorter than 75 nucleotides in length were removed.

RNA-Seq and transcriptome assembly

For A. atlantica, RNA-Seq data consisted 2 × 100 bp PE Illumina reads derived from 11 different plant tissue types including, stem, mature leaf, stressed mature leaf, seed (2 days old), hypocotyl (4–5 days old), root (4–5 days old), vegetative meristem, green grain, yellow grain, young flower (meiotic), and green anthers (Additional file 14: Table S7). For A. eriantha, 2 × 150 PE RNA-Seq data was generated by BGI from six tissue sources, including young leaf, mature leaf, crown, roots, immature panicle, and whole seedling, harvested from plants grown hydroponically in 1× Maxigro™ (GH Inc., Sebastopol, CA, USA) in growth chambers maintained at 21 °C under broad-spectrum halogen lamps, with a 12-h photoperiod at BYU (Additional file 14: Table S7). The resulting reads were trimmed using Trimmomatic [105], then aligned to either the A. atlantica or A. eriantha reference using HiSat2 v2.0.4 [106] with default parameters and max intron length set to 50,000 bp. Following alignment, the resulting SAM file was sorted and indexed using SAMtools v1.6 [107] and assembled into putative transcripts using StringTie v1.3.4 [108]. The quality of the assembled transcriptome was assessed relative to completeness using BLAST comparisons to the reference Brachypodium distachyon L. (ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/brachypodium_distachyon/pep/).

Genome size, assembly, polishing, and scaffolding

Genome size was estimated using Jellyfish [109] and GenomeScope v1.0 [31] at k-mer length = 21 for each species. Initial assemblies were done using Canu v1.7 [33] with default parameters (corMhapSensitivity = normal and corOutCoverage = 40). The resulting assemblies were polished using Arrow from the GenomicConsensus package in the Pacific BioSciences SMRT portal v5.1.0 and PILON v0.22 [110] using Illumina short reads. Chicago® and Hi-C proximity-guided assemblies were performed by Dovetail Genomics LLC (Santa Cruz, CA, USA) to produce chromosome-scale assemblies. Fresh leaf tissue from a single dark-treated (72 h), 3-week-old plant, derived directly from selfing of the original A. atlantica and A. eriantha plants, was sent to Dovetail Genomics for Chicago® and Hi-C library preparation as described by Putnam et al. [35] and Lieberman-Aiden et al. [111], respectively, using the DpnII restriction endonuclease. The libraries were sequenced using a standard Illumina library prep followed by sequencing on an Illumina HiSeq X in rapid run mode. The HiRiSE™ scaffolder and the Chicago® and Hi-C library-based read pairs were used to a produce a likelihood model for genomic distance between read pairs, which was used to break putative miss-joins and to identify and make prospective joins in the de novo Canu assemblies.

Repeat analysis, genome completeness, and annotation

RepeatModeler v1.0.11 [112] and RepeatMasker v4.0.7 [113] were used to quantify and classify repetitive elements in the final assemblies, relative to RepBase libraries v20181026; www.girinst.org). Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0.2 [59] was employed to assess the completeness of the assembly using the Embryophyta odb9 dataset and the “—long” argument, which applies Augustus [114] optimization for self-training.

MAKER2 v2.31.10 [56, 57] was used to annotate the final genomes. Expressed sequence tag evidence for annotation included the de novo transcriptomes for each species as well as the cDNA models from Brachypodium distachyon L. (v 1.0; Ensembl genomes). Protein evidence included the uniprot-sprot database (downloaded September 25, 2018) as well as the peptide models from B. distachyon (v 1.0; Ensembl genomes). Repeats were masked based on species-specific files produced by RepeatModeler. For ab initio gene prediction, A. atlantica and A. eriantha species-specific AUGUSTUS gene prediction models were provided as well as rice (Oryza sativa)-based SNAP models.

Variant identification and tree creation

Single nucleotide polymorphisms (SNPs) for the diversity panel were identified from the Illumina reads by mapping the A-subgenome and C-subgenome diploid accessions against the A. atlantica and A. eriantha reference genome assemblies, respectively, using BWA-mem v0.7.17 [115]. Output SAM files were converted to BAM files and sorted using SAMtools v1.6 [107], and indexed using Sambamba v0.6.8 [116]. InterSnp, an analysis tool from the BamBam v1.4 package [72], was used to call SNPs with the arguments -m 2 and -f 0.35. Bash scripting was used to removed SNPs with less than 100% genotype calls across all accessions (i.e., no missing data) or given the cleistogamous nature of the species where 5% or more of the accessions were called as heterozygotes. SNPs on unscaffolded contigs were also removed prior to phylogenetic analysis. SNPhylo v20160204 [73], which uses MUSCLE [117] for sequence alignments and linkage disequilibrium to down sample the SNP dataset, was used to build Phylogenies with the bootstrapping parameter set to 1000. The resulting tree was visualized using FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree).

Genome comparison

Genomic comparisons, including calculations of synonymous substitutions per synonymous sites (Ks) and homology searches for syntenic gene-sets with Hordeum vulgare L. (CoGe genome id52970 [118]), Oryza sativa L. (CoGe genome id34910 [119]), Zea mays L. (CoGe genome id33766 [120]), and B. distachyon (CoGe genome id52735 Vogel [121]) were accomplished using the DAGchainer output file from the CoGe (https://genomevolution.org/coge/) SynMap tool.