Introduction

The monocot order Poales comprises 16 families and approximately 18,000 species (sensu APG II 2003), and relationships among families are generally well-resolved and supported (Chase 2004; Graham et al. 2006). The largest family within the order, Poaceae (the grasses), has been the focus of many biological studies due to its ecological, economical, and evolutionary importance. Poaceae include species that are the primary source of nutrition for humans and grazing animals, e.g., wheat (Triticum aestivum), maize (Zea mays), rice (Oryza sativa), rye (Lolium perenne), oats (Avena sativa), sorghum (Sorghum bicolor), and barley (Hordeum vulgare). Grasses have also received much attention as sources of biofuels (Carpita and McCann 2008; Rubin 2008). Furthermore, there is considerable interest in using plastid genetic engineering in the grasses for crop improvement and for producing biopharmaceuticals and vaccines (Verma and Daniell 2007).

With its relatively small nuclear genome size and a high degree of gene synteny with other major cereal grasses, rice is commonly used as the model monocot plant system. The first draft of the rice nuclear genome (Indica group) was made available just two years after the completion of the Arabidopsis genome (Yu et al. 2002). Sequencing of six additional grass nuclear genomes is in progress, and these include Brachypodium, maize, rice, foxtail millet (Setaria italica), sorghum, switchgrass (Panicum virgatum), and wheat (Garvin et al. 2008; Rubin 2008). In addition, plastid genome sequences are currently available for 13 grass genera; Brassicaceae is the only family that is as densely sampled (13 total). Unlike the highly conserved sequences of Brassicaceae, Poaceae plastid genomes have experienced several evolutionary phenomena, including accelerated rates of sequence evolution, gene and intron loss, and genomic rearrangements. For these reasons, Poaceae provide an excellent system to examine plastid genome evolution.

The plastid genomes of land plants are generally highly conserved in terms of gene content, order, and organization (Bock 2007; Palmer 1991; Raubeson and Jansen 2005). The genome is circular with a quadripartite structure composed of two copies of a large inverted repeat (IR) separated by large and small single-copy regions (LSC and SSC, respectively). These genomes usually range in size from 100 to 200 kb and contain 100–130 different genes. The majority of the genes, approximately 80, encode proteins involved in photosynthesis and gene expression and the remaining code for tRNAs and rRNAs. While rates of nucleotide substitutions are low in plastid genomes relative to nuclear genomes, a few lineages have experienced rate acceleration. Plastid genomes in the flowering plant family Geraniaceae exhibit extreme rate heterogeneity, and ribosomal protein, RNA polymerase, and ATPase genes were shown to evolve more rapidly than photosynthetic genes (Guisinger et al. 2008). Aside from this recent example, the first and best documented example of rate heterogeneity among photosynthetic angiosperm plastid genomes occurs for grass lineages (Gaut et al. 1993). Notably, the long-branch leading to the grasses has been shown to impede phylogenetic inference (Soltis and Soltis 2004; Stefanovic et al. 2004), although Leebens-Mack et al. (2005) improved relationship resolution among angiosperms using increased taxon sampling. This and other studies emphasize the importance of taxon sampling in phylogenetic and comparative genomic studies in order to accurately infer molecular evolutionary relationships, rates, and patterns (reviewed in Heath et al. 2008).

Previous studies used methods that do not detect the extent of rate acceleration and genome evolution in grasses; relative rate tests limited the number of taxa to three that were examined (Gaut et al. 1993; Muse and Gaut 1994) or non-Poales sequences were used as outgroups (Bortiri et al. 2008; Chang et al. 2006; Matsuoka et al. 2002). More comprehensive analyses using additional Poales plastid genome sequences are needed to better understand the patterns and causes of genome evolution in this group. There are three major goals in the current study. First, we present the complete plastid genome sequence of Typha latifolia L. (Typhaceae), the first non-grass Poales sequenced to date. Second, we characterize Poales genome organization and evolution using nine fully annotated grass plastid genomes. Third, we examine rates and patterns of sequence evolution within and between grasses relative to other monocot and angiosperm plastid genomes. Jansen et al. (2007) described a positive correlation between genomic changes (gene/intron loss and gene order changes) and lineage-specific branch length. In the current study, we use a genome-wide approach to test the degree and nature of rate acceleration, and we specifically examine genomic rearrangements and substitution patterns for the branch leading to the grasses.

Materials and Methods

DNA Source, Plastid Isolation, Genome Amplification, and Sequencing

Leaf material of T. latifolia was field collected in Arizona (R.C. Haberle 188, Arizona, Yavapai Co., TEX). Plastids were isolated from 21 g of fresh leaves using the sucrose-gradient method (Palmer 1986), as modified by Jansen et al. (2005). They were then lysed and the entire plastid genome was amplified by rolling circle amplification (RCA), using the REPLI-g™ whole genome amplification kit (Qiagen Inc., Valencia, CA, USA) following the methods outlined in Jansen et al. (2005). The RCA product was then digested with the restriction enzymes EcoRI and BstBI, and the resulting fragments were separated in a 1% agarose gel to determine the quality of plastid DNA. The RCA product was sheared by serial passage through a narrow aperture using a Hydroshear device (Gene Machines, San Carlos, CA, USA), and the resulting fragments were enzymatically repaired to blunt ends, gel purified, and ligated into pUC18 plasmids. The clones were introduced into Escherichia coli by electroporation, plated onto nutrient agar with antibiotic selection, and grown overnight. Colonies were randomly selected and robotically processed through RCA of plasmid clones, sequencing reactions using BigDye chemistry (Applied Biosystems, Foster City, CA, USA), reaction cleanup using solid-phase reversible immobilization, and sequencing using an ABI 3730 XL automated DNA sequencer were performed. Detailed protocols are available at http://www.jgi.doe.gov/sequencing/index.html.

Genome Assembly and Annotation

Sequences from randomly chosen clones were processed using PHRED and assembled based on overlapping sequence into a draft genome sequence using PHRAP (Ewing and Green 1998; Ewing et al. 1998). Quality of the sequence and assembly was verified using Consed (Gordon et al. 1998). In most regions of the genome, we had 6–12-fold coverage, but there were a few areas with gaps or low depth of coverage. PCR and sequencing at the University of Texas at Austin were used to bridge gaps and fill in areas of low coverage in the genome. Additional sequences were added until a completely contiguous consensus was created representing the entire plastid genome with a minimum of 2X coverage and a consensus quality score of Q40 or greater. The genome was annotated using DOGMA (Dual Organellar GenoMe Annotator, http://dogma.ccbb.utexas.edu; Wyman et al. 2004).

Comparisons of Gene Content and Gene Order

Gene content comparisons were performed using Multipipmaker (Schwartz et al. 2003). Comparisons involved 10 Poales genomes, including Typha latifolia (current study) and nine grasses: Agrostis stolonifera (NC_008591), Brachypodium distachyon (NC_011032), Hordeum vulgare subsp. vulgare (NC_008590), Lolium perenne (NC_009950), Oryza sativa (NC_001320), Saccharum officinarum (NC_006084), Sorghum bicolor (NC_008602), Triticum aestivum (NC_002762), and Zea mays (NC_001666). T. latifolia was used as the reference genome by including an exon file in the analysis. Gene orders were examined by pair-wise comparisons between all 10 genomes using PipMaker (Elnitski et al. 2002).

Genome and Gene Sampling

Phylogenetic and evolutionary rate comparisons were performed for a total of 47 taxa (Supplementary Table S1), including nine grasses, one other member of the Poales (T. latifolia, current study), and representatives from all major angiosperm clades. The plastid genome of O. sativa Indica group is not completely annotated and was not included in our analyses. In addition, the genomes of Festuca (NC_011713), Coix (NC_013273), and two recently published bamboos (Wu et al. 2009) were not included, because they were not publicly available at the time our comparisons were performed. Our analyses included seven non-Poales monocot genome sequences. Nonetheless, we chose not to include Phalaenopsis, because we wanted to include as many protein-coding genes in our analyses as possible, and all 11 ndh-genes have been lost from this genome (Chang et al. 2006). Protein-coding sequences for 73 genes were used with several exclusions (Supplementary Table S2).

Phylogenetic Analyses

Amino acid sequences were aligned using Multiple Sequence Web viewer and Alignment Tool (MSWAT, http://mswat.ccbb.utexas.edu) and manually adjusted, and the amino acid alignment was used to constrain the nucleotide alignment. Maximum parsimony (MP) and maximum likelihood (ML) analyses were performed using PAUP* version 4.10b10 (Swofford 2003) and GARLI version 0.942 (Zwickl 2006), respectively. MP analyses were performed with 100 random addition replicates and TBR branch swapping with the Multrees option. Non-parametric bootstrap analyses were performed for 100 replicates with 1 random addition replicate and TBR branch swapping with the Multrees option. Four independent ML analyses were performed using GARLI under the default settings, and bootstrap values were generated for 100 replicates and the default settings. Likelihood scores were obtained from PAUP*, because it is better at optimizing branch lengths on the final topology (Zwickl 2006).

Evolutionary Rate Estimation

The program codeml implemented in the software package PAML (Yang 2007) was used to estimate dN, dS, and dN/dS. The ML tree generated above was used as a constraint tree, but branch lengths were generated in PAML. Control files were used with the following settings: CodonFreq = 2 (codon frequency model F3x4), NSsites = 0 (no variation among sites for ω), cleandata = 1 (exclude gapped regions), fix_kappa = 0 (kappa to be estimated), and fix_omega = 0 (omega to be estimated). Using the method of Yang (1998), values of dN, dS, and dN/dS were generated. The null model (H0), where dN/dS was averaged and fixed across all taxa, was compared to two alternative models (H1 and H2). The H1 model allowed for two values of dN/dS across the tree: (1) dN/dS for the lineage leading to the grasses and (2) dN/dS for all other taxa. The H2 model allowed for three values of dN/dS across the tree; (1) dN/dS for the lineage leading to the grasses, (2) dN/dS for all other monocot branches, and (3) dN/dS for all other angiosperms. Likelihood ratio tests were used to test the fit of alternative models and model improvement, and correction for multiple comparisons used the Holm’s method (i.e., sequential Bonferroni correction; Holm 1979) and the false discovery rate method (Benjamini and Hochberg 1995). Gene groups were categorized according to gene function or subunits that form a functional complex; values were, respectively, combined for atp-, ndh-, pet-, psa-, psb-, rpl-, rps-, and rpo-genes, according to previous studies (Chang et al. 2006; Guisinger et al. 2008; Matsuoka et al. 2002). Statistical analyses were conducted using the R software package (http://www.r-project.org), and correction for multiple comparisons was employed using Holm’s method (1979).

Results

Size, Gene Content, and Organization of the Typha Plastid Genome

The complete T. latifolia plastid genome is 161,572 base pairs (bp) in length (Supplementary Fig. S1, Table 1, GenBank accession number GU195652). Each IR is 26,390 bp, and the two IR copies are separated by a LSC region of 89,140 bp and a SSC region of 19,652 bp. There are 131 predicted coding regions, 113 of which are different, and 18 that are duplicated in the IR. The coding regions include 79 protein-coding genes, 30 tRNAs, and 4 rRNAs. The T. latifolia plastid genome has 57.1% coding sequence and a 33.8% GC content. Eighteen genes contain introns, including 12 protein-coding genes and 6 tRNAs. The IRs on the LSC boundaries include the duplication of trnH-gug and rps19 and extend 99 bp into the intergenic spacer regions between rps19 and psbA on the IRa/LSC boundary and rps19 and rpl22 on the IRb/LSC boundary (Fig. 1, Supplementary Fig. S1).

Table 1 Comparison of major features of Typha and nine grass plastid genomes
Fig. 1
figure 1

Extent of the inverted repeat (IR) in 10 Poales plastid genomes. Selected genes or portions of genes are indicated by gray boxes above or below the genome. Gene and IR lengths are not to scale (see Table 1 for Poales IR lengths)

Comparisons of Genome Organization of Poales

The complete plastid genome sequences for nine genera of grasses (Asano et al. 2004; Bortiri et al. 2008; Hiratsuka et al. 1989; Maier et al. 1995; Ogihara et al. 2000; Saski et al. 2005) and T. latifolia (current study) enable a comparison of genome organization for two families of Poales (Poaceae and Typhaceae, Table 1). Gene and intron content among all 10 genomes are highly conserved with five differences between the grasses and T. latifolia. Relative to the early diverging Poales T. latifolia, all grasses have lost introns in clpP and rpoC1, as well as the three genes accD, ycf1, and ycf2 (Fig. 2a). In the case of gene losses, there has been a progressive degradation of the gene sequences, because differing lengths of residual sequence remain in several taxa (see arrows in Fig. 2a; Table 2; Supplementary Fig. S2). This is especially evident for ycf2, where the first 200 bp of the gene is present and conserved in all of the grasses, and additional remnants of the last 2,300 bp remain in various grasses. The length of residual ycf2 sequence varies from 698 bp in Oryza to 2,089 bp in Zea. The remnant sequences are conserved with pair-wise divergence ranging from 15.0 to 19.1% relative to T. latifolia. In the case of ycf1, the first 250 bp and the final 700 bp are present and conserved across the grasses. Unlike ycf2, the length of residual ycf1 sequence is conserved, ranging from 837 to 867 bp but with higher levels of sequence divergence (23.4–25.6%). The pattern for accD is quite different, because there are few if any small remnants of this gene in grass plastid genomes, and those that do remain are more divergent (Table 2).

Fig. 2
figure 2

Multipip analyses (Schwartz et al. 2003) showing overall sequence similarity of plastid genomes based on complete genome alignment. Levels of sequence similarity are indicated by black (75–100%), gray (50–75%), and white (<50%). a Comparison of 10 members of Poales, using Typha latifolia as the reference genome. Arrows indicate gene/intron losses and deletions; partial duplication of ycf1 is due to IR expansion. b Comparison of nine Poaceae genomes using Hordeum vulgare as the reference genome. Arrows indicate deletions; 995 bp deletion is present twice because it is in the IR

Table 2 Variation in accD, ycf1, and ycf2 in Poales (length in bp/percent divergence relative to Typha)

Gene order between the grasses and T. latifolia differs as a result of three inversions of 28, 6, and <1 kb (Fig. 3a). These inversions have been known for 20 years from both gene mapping and genome sequencing (Doyle et al. 1992; Hiratsuka et al. 1989; Howe et al. 1988; Katayama and Ogihara 1993), and the two larger inversions overlap, making it possible to determine that the 28 kb inversion occurred prior to the 6 kb inversion (Doyle et al. 1992; Hiratsuka et al. 1989). Within the grasses, gene order is identical in all sequenced plastid genomes (Fig. 3b).

Fig. 3
figure 3

Percent identity plot (Elnitski et al. 2002). a Typha latifolia compared to Hordeum vulgare. Numbers along the x-axis indicate the coordinates for Typha and along the y-axis for Hordeum. INV inversion. b Hordeum vulgare compared to Zea mays. Numbers along the x-axis indicate the coordinates for Hordeum and along the y-axis for Zea

Alignment of the complete plastid genomes indicates that there is a high level of sequence divergence between T. latifolia and the nine grasses (Fig. 2a). Most of the divergent regions (shown in gray or white in Fig. 2a) represent the intron and gene losses, and intergenic regions are the least conserved, containing a few highly divergent regions with large indels (shown in gray or white Fig. 2b). Sequence conservation among the nine grasses is much higher, and again intergenic spacer regions are the most divergent and contain a few large indels (shown in gray or white in Fig. 2b).

The IR in plastid genomes has four boundaries, IRb/LSC, IRb/SSC, IRa/LSC, and IRa/SSC, and there is variation in the extent of duplication of sequences at each of these boundaries in the Poales. All members of the order have expanded the IRb/LSC and IRa/LSC to add both trnH-gug and rps19 to the IR (Fig. 1). However, the extent of IR expansion into the intergenic spacer regions between rps19 to psbA and rps19 to rpl22 varies from 34 to 99 bp among members of the Poales (Fig. 1). At the IR/SSC boundary, the grasses have expanded the IR to duplicate rps15, but expansion beyond rps15 varies within the family. In six of the nine genera, IRa has expanded to duplicate 173–209 bp of ndhH, whereas in the three genera Saccharum, Sorghum, and Zea, IRb has expanded to duplicate 29 bp of ndhF (Fig. 1).

Phylogenetic Relationships

Phylogenetic analyses were performed on an aligned data matrix that included 47 taxa of angiosperms and 73 protein-coding genes. The total length of the aligned data set was 57,603 nucleotides and the Nexus file is available at http://www.biosci.utexas.edu/IB/faculty/jansen/lab/research/data_files/JME-Poales.nex.htm. The MP analysis generated one most parsimonious tree with a length of 104,284, a consistency index (excluding uninformative characters) of 0.36, and a retention index of 0.62. The ML analysis resulted in a tree with –lnL = 568622.59691. The ML and MP trees were largely congruent with each other and with recent phylogenetic analyses based on complete plastid genomes (Jansen et al. 2007; Moore et al. 2007). The only topological differences occurred in the eurosid clade, and these were relatively minor (see inset in Fig. 4). There was strong bootstrap support in all but one node. Further description of results will be limited to monocots and especially Poales. There is strong support for the monophyly of monocots (100% bootstrap values in both ML and MP trees). Acorus is the earliest diverging lineage followed by Lemna, Dioscorea, Yucca, Elaeis, Musa, and finally the Poales, represented by T. latifolia (Typhaceae) and the nine genera of grasses (Poaceae). All of these monocot nodes have bootstrap values >95% and most are 100% in both ML and MP trees. Within grasses all nodes except for one have strong bootstrap support of 100% in both ML and MP trees. The nine genera represent 3 of the 12 recognized subfamilies (sensu GPWG 2001) of grasses (Ehrhartoideae, Panicoideae, and Pooideae), and the monophyly of each is strongly supported. The Ehrhartoideae are sister to the Pooideae, although support for this relationship is weak (54%) in the ML tree.

Fig. 4
figure 4

ML tree of 47 taxa for 73 protein-coding genes (−lnL = 568622.59691). MP analysis was generally congruent, but topological differences are shown in the inset. Bootstrap values are shown at nodes for ML/MP; and only one statistic is reported where values are the same except in the eurosid clade where ML values are shown on the full tree and MP values are on the inset. The Poales clade is shaded and genomic changes within Poales are indicated by black bars. Subfamilies sampled are shown (EHR Ehrhartoideae, POO Pooideae, PAN Panicoideae)

The distribution of 10 plastid structural rearrangements in the Poaceae is plotted on the ML tree (Fig. 4). Eight of these rearrangements (intron losses from two genes, three gene losses, and three inversions) occur on the branch leading to the grasses. The other two changes involve small expansions of the IR/SSC boundaries. The first IR expansion on the IRb/SSC boundary has duplicated 29 bp of ndhF, and this change is restricted to members of the subfamily Panicoideae. The second IR expansion occurs in the IRb/SSC boundary and has resulted in duplication of 173–209 bp of ndhH. This structural change provides further support for the sister relationship between the subfamilies Ehrhartoideae and Pooideae. The ycf2 gene has retained various sized remnant fragments ranging from 698 to 2,089 bp (Table 2). The distribution of the sizes of these remnants is congruent with the phylogenetic tree; the largest fragments are in the early diverging Panicoideae lineage, and smaller fragments are present in the subfamilies Ehrhartoideae and Pooideae (Fig. 4).

Evolutionary Rate Comparisons

Values of dN/dS, dN, and dS were compared within grasses and between grasses and other angiosperm plastid genomes. Individual gene trees indicate a rapid acceleration of nucleotide substitutions for the branch leading to the grasses (Fig. 5ac, f). Wilcoxon rank sum tests were used to determine that over all gene types values of dN and dS (both P < 0.0001) are significantly different for the branch leading to the grasses relative to all other branches (Table 3). However, values of dN/dS over all gene types are not significantly different (P = 0.0984). The phylogenetic trees in Fig. 5af illustrate the degree of substitution variability for the branch leading to the grasses relative to other angiosperm branches. Both dN and dS are highly accelerated in the gene rpl32 (a, b), dN is high for the gene rps11 (c, d), and dS is relatively high for the gene psbJ (e, f).

Fig. 5
figure 5

Sample trees from codeml analyses showing rate acceleration (dN or dS) for three plastid genes. a, b Large subunit ribosomal protein L32. c, d Small subunit ribosomal protein S11. e, f Photosystem II protein J. The Poaceae clade is shaded

Table 3 Branch comparisons of dN/dS, dN, and dS for gene groups

To better understand rates and patterns of nucleotide substitutions, average dN/dS, dN, and dS values per gene were plotted across the length of the plastid genome using the grass gene order (Fig. 6). Values for the branch leading to Poaceae, internal Poaceae branches, non-Poaceae monocot branches, and other angiosperm branches were compared. In general, substitution rates for the branch leading to grasses (shown as circles) are high relative to rates for all other branches. Although dN/dS and dN are highly variable across the genome, dS is broadly accelerated relative to values from other branches. For both dN and dS, values for the internal Poaceae branches (shown as “x”s) are lower across the genome relative to other branches. Aside from the branch leading to Poaceae, rates of sequence evolution in the IR region are low, a phenomenon previously described by Wolfe et al. (1987). Notably, for the branch leading to the grasses, the genes cemA and rps7 exhibit dN/dS ratios greater than 1 (1.66 and 1.44, respectively); however, the raw values of dN and dS are not out of line. Likelihood ratio tests were used to test the fit of a null model to two alternative models (Supplementary Table S3). Significant improvement in likelihood scores was found for a number of genes, notably ATPase, ribosomal protein, and RNA polymerase genes.

Fig. 6
figure 6

Average dN/dS, dN, and dS values per gene plotted across the length of the plastid genome using the grass gene order. Values for the branch leading to Poaceae (circles), internal Poaceae branches (“x”s), non-Poaceae monocot branches (triangles), and other angiosperm branches (squares) were compared. For values of dN/dS, black squares show both non-Poaceae monocot and other angiosperm branches due to PAML model parameters. Note that the scales are different for dN/dS, dN, and dS plots

Modest rate heterogeneity among gene types in angiosperm plastid genomes has been previously described (Gaut et al. 1993; Logacheva et al. 2007), and we used Wilcoxon rank sum tests to compare rates among branches for all genes and for genes encoding subunits of the photosynthetic apparatus, genes involved in gene expression, and genes involved in metabolism (Table 3). For the ratio dN/dS, the branch leading to Poaceae is elevated for subunits of the photosynthetic apparatus relative to all other branches, internal Poaceae, non-Poaceae monocot, and other angiosperm branches (all P = 0.0001). Values of dN/dS for genes involved in gene expression are significantly different in all comparisons; the result “na” is due to model parameters in PAML analyses. For subunits of the photosynthetic apparatus, dN is significantly different for the internal Poaceae branches relative to non-Poaceae monocots and other angiosperm branches (P < 0.0001). Values of dN are highly variable for genes involved in gene expression and metabolism (Table 3), and P-values are less than 0.0001 in all but two comparisons. A similar trend was found for values of dS for all gene groups; values of dS are significantly different (P < 0.0001) in all but four comparisons (Table 3). For all gene types, rates of sequence evolution are significantly similar in the non-Poaceae monocots and other angiosperm branches, and this likely reflects the relative extent of rate homogeneity among the majority of angiosperm plastid genomes.

The degree of rate acceleration for the branch leading to Poaceae for individual genes and gene groups was estimated (Tables 4, 5 6). These results show that the branch leading to Poaceae is not significantly elevated for values of dN/dS (Table 4). However, values of dN for the branch leading to Poaceae are highly accelerated relative to internal Poaceae branches (P < 0.001) and moderately accelerated relative to the non-Poaceae monocot branches (P < 0.0056) (Table 5). The greatest degree of dN increase was found for the individual genes clpP and cemA, but ATPase (atp) and ribosomal protein genes (both large and small subunit; rpl and rps, respectively) are high relative to photosynthetic genes (psa, psb, and pet) (Table 5). In terms of values of dS, the branch leading to Poaceae is significantly accelerated relative to all other branches, and the degree of increase is consistent among genes and gene types (Table 6). For values of both dN and dS, the internal Poaceae branches indicate a strong degree of sequence conservation, and the branch leading to Poaceae is evolving on average 10- to 20-fold faster (P < 0.001) than internal Poaceae branches.

Table 4 The degree of rate acceleration for the ratio of nonsynonymous to synonymous substitutions (dN/dS) on the branch leading to Poaceae relative to other branches in the phylogeny
Table 5 The degree of rate acceleration for nonsynonymous substitutions (dN) on the branch leading to Poaceae relative to other branches in the phylogeny
Table 6 The degree of rate acceleration for synonymous substitutions (dS) on the branch leading to Poaceae relative to other branches in the phylogeny

Discussion

Plastid Genome Organization and Evolution

Our survey of nine Poaceae plastid genomes and the sequence of T. latifolia (Typhaceae) shows that genome organization and rates of sequence evolution are unusual in the Poaceae. Our analyses included the earliest diverging Poales lineage (Typhaceae), but the closest relatives of the Poaceae have not been sequenced. Doyle et al. (1992) surveyed for the distribution of inversions among Poales, and showed that changes were not confined to the Poaceae. Likewise, gene and intron losses may have a broader distribution among Poales, and more data are needed to fully characterize Poales genome evolution. Nonetheless, Poaceae plastid genomes have experienced genomic change relative to T. latifolia and most other angiosperms. Based on our data, gene content in Poales plastid genomes is identical (Table 1) except for the loss of three genes (accD, ycf1, and ycf2) in the Poaceae. The Festuca arundinacea plastid genome sequence on GenBank apparently also lacks intact copies of the genes psbF, rps14, rps18, and ycf4, but these surprising gene losses should be confirmed. Although other differences can be found among the annotations of the published grass genomes, all are due to annotation errors, both for protein-coding genes and tRNA genes. The recently published Brachypodium genome reported 136 genes (Bortiri et al. 2008), but this included ycf68 that has been shown to be non-functional (Raubeson et al. 2007). Three other recently published Poaceae plastid genomes (Saski et al. 2007) also incorrectly identified 32 tRNAs instead of the 30 found in the other sequenced genomes.

Organization and evolution of Poaceae plastid genomes have been examined extensively in early studies using restriction site and gene mapping approaches (Bowman and Dyer 1986; Howe 1985; Howe et al. 1988; Katayama and Ogihara 1993; Prombona and Subramanian 1989; Quigley and Weil 1985; Shimada and Sugiura 1989) and later based on complete genome sequences (Asano et al. 2004; Bortiri et al. 2008; Hiratsuka et al. 1989; Maier et al. 1995; Ogihara et al. 2000; Saski et al. 2007; Wu et al. 2009). These comparisons identified a number of unusual features, including the presence of three inversions in the LSC, the loss of introns from two genes (clpP and rpoC1), the loss of three genes (accD, ycf1, and ycf2), and expansions of the IR/SC boundaries to duplicate trnH-gug and rps19 on the IR/LSC boundary and rps15 on the IR/SSC boundary. Furthermore, the phylogenetic distribution of some of these rearrangements has been examined by combining genome sequencing and PCR-based surveys (Downie et al. 1996; Doyle et al. 1992; Wang et al. 2008). Our comprehensive comparisons of the complete plastid genomes of nine Poaceae and T. latifolia confirm the presence of all of these genomic rearrangements. We will briefly review the conclusions from the above studies, and then highlight the novel aspects resulting from our comprehensive comparisons of sequences of nine published genomes and the related T. latifolia genome.

Among monocots most of the rearrangements identified in Poaceae plastid genomes appear to be restricted to this family, including the loss of introns from two genes, three gene losses, and the smallest of the three genome inversions. In most cases, losses of these same genes and introns have occurred independently elsewhere in angiosperms (Jansen et al. 2007), including four losses of accD, one loss of ycf1, two losses of both clpP introns, and numerous rpoC1 intron losses. In a survey of inversions for 12 of 16 families of Poales, the first and largest 28 kb inversion is shared by the two closely related families Joinvilleaceae and Restionaceae, and the second 6 kb inversion is present in the Joinvilleaceae, supporting its placement as sister to Poaceae (Doyle et al. 1992). Expansions and contractions of the IR have been documented throughout angiosperm plastid genomes (Goulding et al. 1996; Wang et al. 2008). The expansion of the IR/LSC boundary in grasses to duplicate trnH-gug is characteristic of all monocots and some early diverging eudicots, but further expansion to include a complete duplication of rps19 is restricted to a more derived clade of monocots including Asparagales, Commelinales, Zingiberales, Arecales, and Poales (Wang et al. 2008). Our comparisons of nine plastid genomes of Poaceae confirm the expansion of the IR at the IR/LSC boundary resulting in the duplication of both trnH-gug and rps19 and demonstrate that the endpoint of the IR is highly conserved with only 35–99 bp duplicated beyond rps19 (Fig. 1).

Expansion of the IR into the SSC region is much less common, and only a few angiosperm families exhibit this phenomenon, including Campanulaceae (Cosner et al. 1997; Haberle et al. 2008; Knox and Palmer 1999), Geraniaceae (Chumley et al. 2006; Palmer et al. 1987), and Polygonaceae (Aii et al. 1997; Logacheva et al. 2008). Earlier investigations of one or two plastid genomes of Poaceae (Hiratsuka et al. 1989; Maier et al. 1995; Prombona and Subramanian 1989) identified expansion of the IR into the SSC region resulting in the duplication of rps15. In the comparison of rice and maize, variation in the extent of expansion of the IR/SSC boundary beyond rps15 was examined (Maier et al. 1995). In maize, there was an additional expansion of IRb to duplicate 29 bp of ndhF, whereas in rice IRa expanded to duplicate 216 bp of ndhH. Our comparison of the IR/SSC boundaries among the nine sequenced Poaceae plastid genomes (Fig. 1) demonstrates that the pattern of expansion is congruent with phylogenetic relationships, with the IRb expansion restricted to the Panicoideae, and the IRa expansion shared by the subfamilies Ehrhartoideae and Pooideae. This structural feature supports the sister relationships between the latter two tribes, which is congruent with the tree based on nucleotide sequences (Fig. 4).

Based on comparisons of rice and maize plastid genome sequences, Maier et al. (1995) suggested that accD has been completely lost but that ycf2 represents different stages of gene deletion. This conclusion was supported by the fact that neither species had any residual sequence left for accD but that ycf2 has different sized residual fragments in rice and maize. Our comparisons of the sequences of all three missing genes confirms that accD has been almost completely lost from all grasses, and three genera, Hordeum, Lolium, and Oryza, have small, highly divergent remnant sequences (Table 2). The residual sequence in this gene does not correlate with phylogenetic relationships (Fig. 4). The situation in ycf2 is much more interesting based on the more extensive sampling reported here. For this gene, there are three distinct size classes of remnant sequences (698–700, 1,314–1,413, and 2,061–2,089 bp), and these sizes correlate with phylogenetic relationships among grasses. The largest remnant occurs in the earliest diverging panicoid clade, suggesting that there has been a progressive degeneration of ycf2 within grasses. The loss of ycf1 shows yet another pattern in which all nine Poaceae maintain a similar sized remnant sequence of the gene. In the case of both ycf1 and ycf2, the larger amount of residual sequence for these genes could be attributed to their presence in the IR, which is known to be more highly conserved than single-copy regions (Wolfe et al. 1987). A similar argument was made for the high level of sequence conservation of ycf15 and ycf68 (Raubeson et al. 2007). We do not know the extent of gene and intron loss on Poales lineages leading to the Poaceae, and data gathered through future sequencing projects will certainly shed light on rates and mechanisms of gene and intron loss in plastid genomes.

Rates and Patterns of Sequence Evolution in Grass Plastid Genomes

In the current study, we characterize rates and patterns of sequence evolution for angiosperm plastid genomes, and we specifically test the degree and nature of rate acceleration for the branch leading to Poaceae. Our results are consistent with early models of plastid genome evolution; rates of both dN and dS vary across lineages, rates of dS are relatively homogeneous across loci, and rates of dN vary across loci (Muse and Gaut 1994). However, the degree of rate heterogeneity for the branch leading to Poaceae is highly unusual. Aside from a recent study demonstrating extreme rate heterogeneity in Geraniaceae genome sequences (Guisinger et al. 2008), accelerated rates of nucleotide substitutions are typically not found in photosynthetic angiosperm plastid genomes. Results from the current study indicate a high degree of positive or relaxed selection on the branch leading to Poaceae, and the genes cemA and rps7 exhibit dN/dS ratios greater than 1 (1.66 and 1.44, respectively). This ratio is often used as a measure of selective pressures with dN/dS = 1, >1, and <1 indicating neutral mutation rates, positive selection, and purifying selection, respectively (Yang 1998). Additional analyses are needed to determine amino acid sites that may be under positive selection, and it should be noted that the models used in our analyses do not allow for heterogeneous dN/dS ratios among sites (Yang and Nielsen 2002; Yang et al. 2000). The results from likelihood ratio tests (Supplemental Table S3) indicate that a number of genes are accumulating nonsynonymous mutations at a significantly high rate, suggesting that either positive or relaxed selection at nucleotide sites is acting on these genes. The majority of these are ATPase, ribosomal protein, and RNA polymerase genes.

In addition to better characterizing rates of sequence evolution for the branch leading to the grasses, our results are generally consistent with other findings regarding grass plastid genome evolution. The individual genes clpP, cemA, and rpl32 seem to be evolving rapidly, photosynthetic genes are evolving slower than ribosomal protein genes and appear to be under stronger purifying selection, dN varies across loci, dS is uniform across loci, and substitution rates are accelerated for the grasses relative to other angiosperms (Chang et al. 2006; Matsuoka et al. 2002; Muse and Gaut 1997). Although Chang et al. (2006) found that values of dS were not significantly different for the grasses relative to one other monocot (the orchid Phalaenopsis), we show that dS is significantly different between grass and monocot branches. We chose to exclude the genome sequence of Phalaenopsis (Chang et al. 2006) from our analyses, because all 11 ndh-genes have been lost. Furthermore, we were able to include plastid genome sequences from seven other monocots than the previously mentioned study. In our analyses, the branch leading to the Poaceae and internal Poaceae branches are compared separately, and only the branch leading to the Poaceae exhibits a significant amount of rate acceleration. Moreover, internal Poaceae branches are evolving at a slower rate than other branches in the phylogeny (Figs. 5, 6). It appears that after a rapid burst in sequence evolution, rates decelerated in grass plastid genomes, and this deceleration occurred subsequent to grass diversification. Using increased taxon sampling and methods that detect the degree and nature of rate change on specific branches in a phylogeny, we are able to better characterize grass plastid genome sequence evolution.

Factors affecting rates of sequence evolution in plastid genomes have been extensively examined. Speciation rates (Barraclough et al. 1996; Bousquet et al. 1992), generation time (Chang et al. 2006; Smith and Donoghue 2008), substitution and codon bias (Morton 2003; Morton and Clegg 1995), gene function (Matsuoka et al. 2002), gene copy number (genes duplicated in the IR evolve slower than single-copy genes (Wolfe et al. 1987)), and genome copy number (Khakhlova and Bock 2006) have been shown to influence substitution rates. However, a recent phylogenetic analysis of 81 plastid genes from 64 seed plants described a positive correlation between genomic rearrangements and lineage-specific rate acceleration (Jansen et al. 2007). Furthermore, the highly rearranged plastid genomes of the plant family Geraniaceae exhibit the greatest degree of rate acceleration among photosynthetic angiosperms (Guisinger et al. 2008). As shown in Fig. 4, there are eight major structural changes on the branch leading to grasses, including the loss of introns in two genes, three gene losses, and three inversions. We hypothesize that rates of sequence evolution may be correlated to genomic changes in grass plastid genomes. It should be noted that after the divergence of the grasses no major genomic changes occurred aside from minor expansions of the IR region, a very common process that accounts for size variation in plastid genomes throughout angiosperms (Aii et al. 1997; Goulding et al. 1996; Plunkett and Downie 2000) including the monocots (Wang et al. 2008).

A correlation between genomic changes and rates of sequence evolution has been previously described for bacterial (Belda et al. 2005) and animal mitochondrial genomes (Shao et al. 2003; Xu et al. 2006). Mechanisms have been proposed to explain this correlation, and it is possible that similar mechanisms are responsible for the unusual evolution of grass plastid genomes. One mechanism involves homologs to the eubacterial gene recA. In E. coli, this gene is responsible for DNA repair during homologous recombination and strand exchange (Lin et al. 2006). Homologs of recA are found in plant and algal nuclear genomes (Lin et al. 2006), and gene products are localized to plastids and mitochondria in Arabidopsis (Cao et al. 1997; Cerutti et al. 1992; Khazi et al. 2003). It is possible that genomic changes and accelerated rates of sequence evolution are the result of mutations in plastid-targeted rec-genes, although their presence and function in plastid genomes has not been thoroughly tested.

Conclusion

In agreement with earlier studies using large data sets of morphological, anatomical, and single plastid gene sequence characters (Barker et al. 1995; Clark et al. 1995; Kellogg and Watson 1993), our study suggests that the grasses have experienced rapid molecular diversification relative to other monocots and to early diverging members of the Poales, i.e., T. latifolia. This point was made well by Chase (2004), who noted that there is a pattern of small, “insignificant” sets of taxa sister to Poaceae. Graham et al. (2006) performed a phylogenetic analysis of 17 plastid protein-coding genes and included taxa from Poaceae, Typhaceae, and eight additional Poales families. Branch lengths for most members of the Poales were long except for the three earliest diverging families, including Typhaceae. These data would suggest that genomic changes and accelerated rates of sequence evolution may not be limited to the Poaceae only, and that a positive correlation between these two phenomena can be shown for lineages leading to the Poaceae. We emphasize that additional Poales genome sequences are needed to fully understand the evolution of Poales and Poaceae plastid genomes. Nonetheless, we show the extent to which plastid genomes within the Poaceae are experiencing rapid rates of genomic change and sequence evolution. We also show that the rates of plastid genome evolution for internal Poaceae branches have decelerated. Whatever the cause of rapid change in the branch leading to the grasses, subsequent deceleration indicates that the factors responsible may no longer be driving genome evolution in this family.