EST sequencing and phylogenetic analysis of the model grass Brachypodium distachyon
- First Online:
- Cite this article as:
- Vogel, J.P., Gu, Y.Q., Twigg, P. et al. Theor Appl Genet (2006) 113: 186. doi:10.1007/s00122-006-0285-3
- 597 Views
Brachypodium distachyon (Brachypodium) is a temperate grass with the physical and genomic attributes necessary for a model system (small size, rapid generation time, self-fertile, small genome size, diploidy in some accessions). To increase the utility of Brachypodium as a model grass, we sequenced 20,440 expressed sequence tags (ESTs) from five cDNA libraries made from leaves, stems plus leaf sheaths, roots, callus and developing seed heads. The ESTs had an average trimmed length of 650 bp. Blast nucleotide alignments against SwissProt and GenBank non-redundant databases were performed and a total of 99.9% of the ESTs were found to have some similarity to existing protein or nucleotide sequences. Tentative functional classification of 77% of the sequences was possible by association with gene ontology or clusters of orthologous group’s index descriptors. To demonstrate the utility of this EST collection for studying cell wall composition, we identified homologs for the genes involved in the biosynthesis of lignin subunits. A subset of the ESTs was used for phylogenetic analysis that reinforced the close relationship of Brachypodium to wheat and barley.
The application of model systems toward the study of both basic and applied problems in plant biology has become routine. Researchers have employed the model dicot Arabidopsis thaliana to study topics ranging from nutrient uptake and metabolism to plant–pathogen interactions. Unfortunately, due to its distant relationship to monocots, Arabidopsis is not suitable to study biological features unique to the grasses (e.g. cell wall composition). With its sequenced genome and established research community, rice can serve as a model grass for some applications. Unfortunately, as a specialized semi-aquatic tropical grass that diverged from the most important forage grasses and temperate grains approximately 50 million years ago (Gaut 2002), rice is not a good model for temperate grasses. Rice also has physical and physiological attributes that limit its utility as a model system. Its large size and long generation time make experiments involving growing large numbers of plants under controlled conditions very expensive. It is also challenging to grow rice under the conditions typically present in greenhouses in northern climates.
Brachypodium distachyon (Brachypodium) is a small temperate grass with all the attributes needed to be a modern model organism including simple growth requirements, fast generation time, small genome and self-fertility (Draper et al. 2001). Brachypodium is also readily transformed by Agrobacterium or biolistics facilitating many biotechnological applications (Christiansen et al. 2005; Vogel et al. 2006). The haploid genome size of diploid Brachypodium is approximately 0.36 pg, slightly over twice the size of Arabidopsis (Bennett and Leitch 2005; Vogel et al. 2006). Thus, Brachypodium possesses one of the smallest genomes of any grass and is suitable for both functional and structural genomic research.
Brachypodium belongs to the subfamily Pooideae and diverged just prior to the radiation of the small grain crops and forage and turf grasses, making it “sister” to the temperate grasses of greatest economic significance (Kellogg 2001). This placement is based on limited sequence data from internal transcribed spacers (ITS), the 5.8S subunit of nuclear ribosomal DNA and the chloroplast ndfH gene (Catalán and Olmstead 2000; Hsaio et al. 1994). RFLP and RAPD markers also indicate a similar placement of Brachypodium (Catalan et al. 1995; Shi et al. 1993). However, as was seen with rice, phylogenies based on different genes can produce phylogenetic trees with different topologies (reviewed in Kellogg 1998). Thus, it would be desirable to establish the relationship of Brachypodium to the temperate grasses using additional genes.
Brachypodium can also serve as a model to study polyploidy because polyploid and diploid accessions are available. An interesting observation is that accessions that are hexaploid by chromosome counts only have approximately twice as much DNA as diploids (Vogel et al. 2006). This may have arisen through the loss of DNA after the event leading to polyploidy through the process called diploidization (Wolfe 2001). Alternatively, the hexaploid accessions may have arisen from the combination of ancestors with chromosomes of different sizes.
Partially sequencing large numbers of random cDNA clones to generate a collection of expressed sequence tags (ESTs) is a fast and efficient way to provide a wealth of genomic information about a particular species (Adams et al. 1991). Such ESTs can be used for functional genomic experiments including the construction of cDNA microarrays and reverse genetics through gene silencing. ESTs can also serve as the raw material from which molecular markers suitable for mapping experiments can be made. Currently, wheat, corn, rice, barley, sorghum and sugarcane all have > 200,000 ESTs deposited in the dbEST division of GenBank. These ESTs have been used for many projects including: microarrays, analysis of gene expression patterns, generation of molecular markers, and physical mapping (including the assignment of genes to delineated regions in the wheat genome). Here we report the construction of five Brachypodium cDNA libraries, the sequencing of 20,440 ESTs, the identification of transcripts for the genes involved in lignin monomer biosynthesis and the construction of a phylogenetic tree for Brachypodium, rice, wheat, barley, corn, sorghum, sugarcane, Arabidopsis, soybean, tomato, poplar and pine based on 11,118 bp from 20 highly expressed nuclear genes.
Materials and methods
Plant material and growth conditions
Two lines of diploid B. distachyon derived from a single wild collection were used for this study. One line, Bd21, underwent five generations of single seed-descent and was used for the stem plus sheath, leaf, callus and root cDNA libraries (Vogel et al. 2006). The developing seed head cDNA library was constructed using another line, Bd21-0, that had not undergone single seed-descent. Plants for the stem plus sheath, leaf and developing seed head libraries were grown in a greenhouse as described (Vogel et al. 2006). Stems plus leaf sheaths, and leaves were harvested from the same plants shortly after anthesis. Entire developing seed heads containing seeds that ranged in maturity from anthesis to almost mature as well as the subtending bracts and stems were harvested and mixed together. Callus was initiated and grown as described (Vogel et al. 2006). A mixture of embryogenic and non-embryogenic callus was used.
To isolate clean roots, plants were grown hydroponically using a raft prepared from aluminum foil and a ring of styrofoam. The raft was placed on the surface of water containing 1 g/l of Peter’s 20-20-20 fertilizer (Scotts, Marysville, OH, USA). The fertilizer mix was contained in a 14 cm deep reservoir. Exposed water was covered with aluminum foil to prevent algal growth. Seeds were placed on paper towels and incubated at 4°C for 7 days to synchronize germination. The seeds were then placed at 22°C for 2 days to promote germination. Germinated seeds were inserted into small holes in the aluminum foil such that the emerging root was submerged in the fertilizer solution. The hydroponic apparatus was placed in a growth chamber under continuous illumination at 24°C. Roots were harvested after 3 weeks. All plant material used for library construction was flash frozen in liquid nitrogen prior to RNA extraction.
Total RNA was extracted from 4 g aliquots of stems plus sheaths, leaves, roots, developing seed heads, and callus tissue by first grinding in liquid nitrogen. The resulting powder was scraped into an RNase-free 50 ml tube and processed using Plant RNA Reagent (Invitrogen, Carlsbad, CA, USA). Poly A+ RNA was extracted from the total RNA samples using the FastTrackMAG Maxi mRNA isolation kit (Invitrogen, Carlsbad, CA, USA). The yield and integrity of total and poly A+ RNA were assessed by agarose gel electrophoresis and UV spectrophotometry at 260 nm (Beckman DU-640, Fullerton, CA, USA).
Library construction was performed using the Superscript II system for cDNA synthesis and cloning (Invitrogen, Carlsbad, CA, USA) with 5 μg of poly A+ RNA as starting material. First strand synthesis was performed using the NotI adapter-primer to add a NotI site on the downstream portion of the cDNA. Following second strand synthesis, the resulting cDNA was ligated to SalI adapters, cut with NotI, size selected, and ligated into the pSPORT1 plasmid vector prior to transformation into OmniMax 2 T1 phage-resistant chemically competent cells (Invitrogen, Carlsbad, CA, USA). The resulting transformants were selected on LB agar plates containing 100 μg/ml ampicillin. Plasmid minipreps and sequencing using an M13 reverse sequencing primer was carried out as described (Tobias et al. 2005).
Raw sequence files were processed using the Phred base-calling program (Ewing and Green 1998; Ewing et al. 1998). Phred also trimmed the sequences based on data quality using a probability cutoff value of 0.05 (Phred score ≥ 20) to retain only the high quality segment of the sequence. The trimmed sequences were further processed to mask the ends of reads that contained vector and adapter sequence using the program Crossmatch (http://www.phrap.org). Masked sequences were then removed from the sequence and quality files using an in-house Perl script (Lazo et al. 2004). Sequences less than 100 bp in length after processing were excluded from analysis. Sequence tracefiles and quality scores are available at the project website (http://wheat.pw.usda.gov/bEST).
The BlastX algorithm was used to compare ESTs to the EBI UniProt database (Release 4.2). Gene ontology (GO) terms were assigned to individual ESTs by cross referencing the UniProt hits to GO terms using the association tables found on the Gene Ontology Consortium site (http://www.geneontology.org). ESTs without GO associations were compared to the NCBI non-redundant database (Release 144) using BlastX and BlastN. The matches to the non-redundant database were then matched to descriptions from the NCBI clusters of orthologous groups (COG) Index (http://www.ncbi.nlm.nih.gov/COG/). ESTs without GO terms or non-redundant matches were compared to ESTs contained in the dbEST database using BlastN. The Phrap assembly program was used to assemble the ESTs into contigs. The Phrap parameters used (penalty −5; minmatch 50; minscore 100) resulted in EST clusters of > 90% identity over a 100 bp window.
Identification of genes involved in lignin biosynthesis
Homologs for ten genes involved in the synthesis of monolignols (PAL, phenylalanine ammonia-lyase; C4H, cinnamate 4-hydroxylase; 4CL, 4-(hydroxy)cinnamoyl CoA ligase; CST, hydroxycinnamoyl CoA:shikimate hydroxycinnamoyltransferase; C3H, p-coumarate 3-hydroxylase; CCoAOMT, caffeoyl CoA O-methyltransferase; CCR, cinnamoyl CoA reductase; F5H, ferulate 5-hydroxylase; COMT, caffeic acid/5-hydroxyferulic acid O-methyltransferase; CAD, cinnamyl alcohol dehydrogenase) (reviewed in Humphreys and Chapple 2002) were identified through BLAST searches of the Brachypodium ESTs using known sequences for each gene. Brachypodium ESTs with highly significant BLAST scores were then used to query the GenBank non-redundant database. ESTs that hit the correct lignin biosynthetic gene were then compared to a list of ESTs contained in Brachypodium contigs to determine the number of tentatively unique genes.
Brachypodium contigs containing the most ESTs were selected as candidates for highly expressed genes to be used for phylogenetic analysis. Several steps were then used to select 20 suitable genes from which to construct phylogenetic trees. First, candidate contigs were compared to the NCBI non-redundant database using the BlastN algorithm to select contigs with highly significant hits to known genes. In cases where multiple contigs matched the same gene only one contig was selected for further analysis. Contigs corresponding to known genes were then used to retrieve ESTs from barley (Hordeum vulgare), rice (Oryza sativa), sugarcane (Saccharum officinarum), sorghum (Sorghum bicolor), wheat (Triticum aestivum) and corn (Zea mays) using BlastN with organism limits against dbEST. Four dicots, A. thaliana, soybean (Glycine max), tomato (Lycopersicon esculentum), poplar (Populus trichocarpa or P. trichocarpa × deltoides) and the gymnosperm pine (Pinus taeda) were included in the analysis to provide a buffer against artifacts in the grass lineage stemming from systematic bias. Pine also serves as an outgroup. The tBlastX algorithm was used to identify ESTs from soybean, tomato, poplar and pine that were the most similar to the Brachypodium genes. For three genes we could not identify suitable P. trichocarpa ESTs and used ESTs derived from the hybrid P. trichocarpa × deltoides. To identify the corresponding genes from Arabidopsis, the contigs were compared to the TAIR database of Arabidopsis proteins (http://www.arabidopsis.org/Blast/) using BlastX. For each individual contig, DNA sequences from the top four scoring ESTs for each of the six grasses, the top three scoring ESTs from soybean, tomato, poplar and pine and the coding sequence from Arabidopsis were aligned by the ClustalW algorithm using MegAlign software (DNAstar, Madison, WI, USA). A portion of the alignment that contained sequence information for all ESTs was selected to generate a phylogenetic tree using MegAlign. If ESTs from the same species were adjacent to one another in the cladogram, then the gene was selected for further study.
The partial coding sequences (average 556 bp/gene) from the 20 highly expressed genes selected above were then used for phylogenetic analysis. The aligned sequences were combined to produce one sequence for each species that was used for phylogenetic analysis (supplemental data S1). The consensus sequence was 11,118 bp. Phylogenetic trees based on this alignment were constructed using seven programs: MegAlign, ClustalX (ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/; Thompson et al. 1997), and five programs (DNApars, DNAcomp, DNAml, DNApenny, DNAmlk) contained in the PHYLIP software package version 3.6 (Felsenstein 1989). Bootstrap analysis was conducted using the Seqboot (to create bootstrapped datasets) and Consense (to create a consensus tree) programs contained in the PHYLIP package along with the phylogeny analysis programs. Bootstrap analysis for ClustalX used the internal bootstrap feature.
Library construction and sequencing
The cDNA libraries constructed were of high quality with approximately 400,000 clones per library. The average insert size based on restriction digests of 96 clones per library was 1.5 kb for all the libraries with the exception of the root library that had an average insert size of 1 kb (not shown).
Brachypodium EST summary
High quality ESTs
Average length (bp)
Developing seed head
Stem plus sheath
Of the 20,587 B. distachyon ESTs sequenced, 15,595 (76%) were assigned GO terms in any category (biological, cellular, molecular), 14,617 (71%) were assigned GO molecular terms, 13,770 (69%) were assigned GO biological terms, and 11,898 (58%) were assigned GO cellular terms (the complete list of GO matches is available at: http://wheat.pw.usda.gov/pubs/2006/Vogel/). Of the remaining 4,992 ESTs, 4,982 had matches to the NCBI non-redundant database; 6% of these were placed in functional categories as classified by the NCBI COG index and 94% were placed in unknown or genome categories (not shown). Of the ten ESTs that did not match the NCBI non-redundant database three of these matched sequences in the dbEST database and the rest were highly repetitive.
EST contig assembly and library comparisons
Lignin biosynthetic genes
Representation of lignin biosynthetic genes in Brachypodium ESTs
Stem plus sheath
Developing seed head
Tentatively unique genes
CST or CQT
Accession numbers of sequences used for phylogenetic analysis
Top BLAST hit
Chlorophyll a/b-binding protein CP26
Heat shock protein 70
Reversibly glycosylated polypeptide
Cytosolic heat shock protein 90
Putative chlorophyll a/b-binding protein type III
dnaK-type molecular chaperone hsp70
23 kDa oxygen evolving protein of photosystem II
In contrast to the robust placement of the grasses within the tree, the placement of the dicots was ambiguous. No two programs gave the same topology and the bootstrap values were very low ranging from 420 to 700 out of 1,000. This lack of resolution is due to the low nucleotide sequence conservation among the dicots. The percent identity among the dicots ranged from 77.4 to 79.6% whereas the percent identity among the grasses ranged from 86.5 to 96.2%. Given this uncertainty, the dicots are presented as unresolved (Fig. 2). Since the role of the dicot sequences was to buffer against artifacts in the grass clade their placement in the tree is not critical.
Temperate grasses are extremely important to humans because they supply a large percentage of our food either directly through the consumption of grains or indirectly through the consumption of grass-fed animals. Temperate grasses are also poised to play an increasing role in supplying energy due to the increasing economic, environmental and social expenses associated with petroleum-based fuels. The most important temperate grains and forage grasses are, for the most part, difficult experimental subjects in the laboratory. Taken together, these reasons point to the need for a model temperate grass that can be used to make rapid gains in our knowledge about the unique attributes of the temperate grasses. Brachypodium is well suited to serve as such a model grass. The EST sequences generated in this study greatly increase the Brachypodium sequence information available. Brachypodium now ranks ninth among the grasses in terms of ESTs contained in dbEST. While the modest size of our sequencing effort does not represent a comprehensive analysis of the expressed portion of the Brachypodium genome, it does provide enough information to begin functional genomic experiments and also provides the raw material for generating molecular markers.
We chose to sequence ESTs from cDNA libraries constructed from five different plant parts to maximize the number of genes represented in the EST collection. Our comparison of the libraries to one another supported the obvious biological differences between the materials used to make the libraries. For example, the developing seed head library had the highest percentage of library specific contigs which is likely due to the complex nature and specialized tissues contained in a seed head that are not found in the other libraries. As another example, the root library had the least overlap with the leaf and stem plus sheath libraries. This is not surprising given the photosynthetic nature of the leaf and stem.
Due to the increasing economic, environmental and social costs associated with the consumption of fossil fuels, ethanol derived from ligno-cellulosic biomass has become an attractive alternative fuel source. To produce ethanol from biomass, the sugars locked in the cellulose and hemicellulose fraction of the cell wall are degraded to monosaccharides using a combination of physical, chemical and enzymatic treatments and then fermented into ethanol. The lignin fractions of the cell wall is not converted into ethanol and, in fact, interferes with the conversion. Lignin also decreases the digestibility of forage grasses. Thus, manipulation of the lignin biosynthetic pathway is an obvious target for biotechnological enhancement of forage grasses and grasses grown specifically for conversion into ethanol (Chen et al. 2003). We identified homologs to all ten genes currently thought to be required for the biosynthesis of lignin precursors (reviewed in Humphreys and Chapple 2002). Thus, we now have the starting pieces to rapidly evaluate different strategies for altering lignin content or composition in a temperate grass.
Our phylogenetic analysis based on 11,118 bp from 20 genes indicated that Brachypodium is much more closely related to wheat and barley than to corn, sugarcane or sorghum. This is consistent with previous reports (Kellogg 2001) based on much smaller data sets. That all seven computer programs used in our study arrived at the same conclusion with extremely high bootstrap values underscores the monophyletic nature of these two clades. In light of the strong support of the relationship among these grasses based upon the combined data set, it is interesting to note that some of the individual genes gave different phylogenetic trees. This underscores the importance of using large data sets from multiple genes to draw conclusions about phylogeny.
In contrast to the placement of the grasses, the position of the dicots within the phylogenetic trees was variable and not well supported by bootstrap analysis. In fact, none of the trees were identical to a phylogeny recently produced for the dicots (Judd and Olmstead 2004). The contrast between the robust clade containing the grasses and the unresolved relationship among the dicots is due to the greater sequence divergence observed among the dicots as compared to the divergence among the grasses. An analysis using protein rather than nucleotide sequence may have helped resolve the dicots, but that would have discarded information critical to resolving the highly related grasses. Since our goal was to define the relationship of Brachypodium within the grasses and dicot sequences were only added to act as a buffer against bias within the grasses, the lack of resolution in the dicot clade is acceptable.
We have developed a valuable resource for the emerging Brachypodium research community that can be used for functional genomic experiments, as a starting point for the development of molecular markers, and as anchor sequences for BAC-based physical maps. Our initial analysis of these ESTs confirmed the close relationship of Brachypodium to wheat and barley and identified homologs for all the genes required for lignin monomer biosynthesis.
Supported by the United States Department of Agriculture, Agricultural Research Service CRIS 5325-21000-013-00, NP307 Biofuel and Bioenergy Alternatives. This work was also supported in part by NIH Grant P20 RR16569 from the BRIN Program of the National Center for Research Resources, and a University of Nebraska at Kearney Research Services Council University Research & Creative Activity Grant.