Spruce proteome DB: a resource for conifer proteomics research
- First Online:
- Cite this article as:
- Lippert, D., Yuen, M. & Bohlmann, J. Tree Genetics & Genomes (2009) 5: 723. doi:10.1007/s11295-009-0220-2
- 389 Views
Proteomics research is hampered in many organisms due to a lack of an appropriate reference genome sequence that can be used in the interpretation of tandem mass spectrometry data for the identification of proteins. Public DNA sequence repositories have grown to considerable size and can, in most cases, serve to provide at least partial interpretation of a large-scale proteomics dataset. However, when species-specific sequences or sequences from a closely related species are available, a boutique sequence database can provide considerable increases in specificity, confidence, and completeness of protein identification. Here, we describe the development of a protein database from a large-scale expressed sequence tag and full-length complementary DNA sequencing project in the economically and ecologically important spruce (Picea) genus.
In recent years, there has been growing appreciation of the need to apply systems biology approaches that go beyond the genome level to the study of plant science (Cui et al. 2008; Long et al. 2008; Nelson et al. 2008). This is due to the realization that the assignment of gene function and the understanding of dynamic molecular phenotypes depend greatly on the ability to measure changes occurring beyond the level of gene expression. One method of achieving this is through the measurement and characterization of the proteins being expressed within a biological system (i.e., proteomics). Proteomics represents a rapidly developing technical discipline that encompasses a wide range of activities such as the analysis of changing protein abundance (Bachi and Bonaldi 2008), posttranslational modifications (de la Fuente van Bentem et al. 2008), and functional protein interaction networks (Collura and Boissy 2007). However, proteomics methods continue to be underutilized in the area of plant biology outside of their application in well-defined model systems like Arabidopsis thaliana and Oryza sativa (Chen and Harmon 2006; Pan et al. 2005). As a rapidly developing discipline, it is through the creation of new tools that the value of these methods will be unlocked in other plant systems.
Proteomics research relies heavily on the use of tandem mass spectrometry, and an average dataset typically consists of tens to hundreds of thousands of individual mass spectra. By extension, proteomics research is critically dependent upon the availability of sequence databases for the rapid and unsupervised interpretation of these spectra to provide meaningful peptide sequence assignments and the associated protein identifications. Organisms for which the genome has not been sequenced have typically been at a disadvantage with respect to the practical application of proteomics methods. These organisms typically rely on searching against sequences from related species that share sequence identity with the organism under study. For species of spruce (Picea spp.) and other conifers, the most closely related genomes are all from evolutionarily distant angiosperm species (e.g., A. thaliana, rice, poplar, grapevine). However, it has been shown that distantly related sequences function poorly in the interpretation of proteomics data (Huang et al. 2006). The spruce proteome database (DB) described here was assembled from the sequence data produced during a large-scale expressed sequence tag (EST) and full-length complementary DNA (FLcDNA) sequencing project in spruce (Ralph et al. 2008) with representative sequences taken from Picea sitchensis (Sitka spruce), Picea glauca (white spruce), and Picea glauca × engelmannii (interior spruce). Spruce proteome DB is an expansion of the databases used in prior proteomics studies performed in these conifer species (Lippert et al. 2005, 2007, 2009) and consists of a set of related protein databases representing these three spruce species and hybrids studied in the Treenomix project (www.treenomix.ca). Spruce proteome DB is, to our knowledge, the most comprehensive and appropriate sequence resource for studying conifer and other gymnosperm proteomes. Spruce proteome DB complements other database resources that provide general information on conifers (e.g., The Gymnosperm Database; http://www.conifers.org/index.html) and conifer genomics (e.g., TreeGenes, http://dendrome.ucdavis.edu/treegenes/).
In addition to the main databases described above, decoy versions of each database have been produced that contain head-to-tail reversed sequences. These decoy databases are provided separately and can be combined with spruce proteome DB to assess the level of false-positive protein identification that is obtained from any proteomics dataset following a database search. The implementation of this approach for the analysis of proteomics data has been previously described (Huttlin et al. 2007). In brief, matches to reverse sequences represent random incorrect matches and the scores that are obtained against these reverse sequences can be used to empirically determine an appropriate score cutoff when interpreting the result of a proteomics database search.
Database implementation and access
Spruce proteome DB can be accessed and used for the interpretation of tandem mass spectrometry data through an instance of the global proteome machine (GPM; Craig and Beavis 2004) at http://treenomix.ca/Home/ResearchActivities/FunctionalGenomics/ProteinProfiling/SpruceDB.aspx (username: tggreview; password: treenomix). Users can upload their peak extracted data in any GPM compatible format (e.g., .mgf, .mzxml). The complete spruce proteome DB can also be obtained by direct download for use with other proteomics analysis software platforms. The database is provided in fasta format and should be compatible with all commercial and open-source platforms. At present, the database has been successfully tested with both Mascot (Perkins et al. 1999) and ProteinPilot (Applied Biosystems, Foster City, CA, USA) as alternative search engines.
Performance comparison of spruce proteome DB and NCBInr plant for Norway spruce protein identification from tandem mass spectrometry data
Sitka spruce proteome DB
Plant translated UniGene (NCBI)
Sitka spruce UniGene (NCBI)
Number of protein sequences in the database
Number of spectra submitted
Number of proteins identified (log(e) ≤ −3.0)
In its present form, spruce proteome DB provides a resource tailored to the analysis of proteomics data in species of spruce, which is one of the largest species groups in the conifers including many of the economically and ecologically most important forest tree species of the northern hemisphere. This database may also provide benefit in the analysis of proteomics data from other conifers and gymnosperms. Future versions of this database will expand upon the depth of spruce proteome coverage as new sequencing efforts are undertaken but will also attempt to gather and process sequences from other conifer and gymnosperm species to expanding both the size, diversity of species included, and the general utility of the database for the analysis of gymnosperm proteomics.
We would like to thank Richard Varhol and An He (Canada’s Michael Smith Genome Sciences Center, Vancouver, British Columbia) for bioinformatics support. The work described in this paper was supported with a grant from the Natural Sciences and Engineering Research Council of Canada (to J.B.) and funding from Genome British Columbia and Genome Canada for the Treenomix Conifer Forest Health project (www.treenomix.ca; to J.B.). J.B. is a University of British Columbia Distinguished University Scholar.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.