Evolutionarily consistent families in SCOP: sequence, structure and function
SCOP is a hierarchical domain classification system for proteins of known structure. The superfamily level has a clear definition: Protein domains belong to the same superfamily if there is structural, functional and sequence evidence for a common evolutionary ancestor. Superfamilies are sub-classified into families, however, there is not such a clear basis for the family level groupings. Do SCOP families group together domains with sequence similarity, do they group domains with similar structure or by common function? It is these questions we answer, but most importantly, whether each family represents a distinct phylogenetic group within a superfamily.
Several phylogenetic trees were generated for each superfamily: one derived from a multiple sequence alignment, one based on structural distances, and the final two from presence/absence of GO terms or EC numbers assigned to domains. The topologies of the resulting trees and confidence values were compared to the SCOP family classification.
We show that SCOP family groupings are evolutionarily consistent to a very high degree with respect to classical sequence phylogenetics. The trees built from (automatically generated) structural distances correlate well, but are not always consistent with SCOP (hand annotated) groupings. Trees derived from functional data are less consistent with the family level than those from structure or sequence, though the majority still agree. Much of GO and EC annotation applies directly to one family or subset of the family; relatively few terms apply at the superfamily level. Maximum sequence diversity within a family is on average 22% but close to zero for superfamilies.
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247:536–540.
- Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 2004, 32:226–229. CrossRef
- Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acid Res 2008, 36:419–425. CrossRef
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al.: The Protein Data Bank. Nucleic Acids Res 2000, 28:235–242. CrossRef
- Gough J, Chothia C: SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 2002, 30:268–272. CrossRef
- Chothia C, Gough J, Vogel C, Teichmann SA: Evolution of the protein repertoire. Science 2003, 300:1701–1703. CrossRef
- Holm L, Sander C: Mapping the protein universe. Science 1996, 273:595–602. CrossRef
- Overington JP, Al-Lazikani B, Hopkins AL: How many drug targets are there? Nature 1996, 5:993–996.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H: Gene ontology: tool for the unification of biology. Nat Genet 2000, 25:25–29. CrossRef
- Hill DP, Davis AP, Richardson JE, Corradi JP, Ringwald M, et al.: Program description: strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. Genomics 2001, 74:121–128. CrossRef
- Rokas A, Williams BL, King N, Carroll SB: Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 2003, 425:798–804. CrossRef
- Hillis DM, Bull JJ: An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst Biol 1993, 42:182–192.
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al.: The Pfam protein families database. Nucleic Acids Res 2010, 38:D211-D222. CrossRef
- Pethica R, Barker G, Kovacs T, Gough J: TreeVector: scalable, interactive, phylogenetic trees for the web. PLoS One 2010,5(1):e8934. CrossRef
- Gough J, Chothia C: The linked conservation of structure and function in a family of high diversity: the monomeric cupredoxins. Structure 2004, 12:917–925. CrossRef
- Olsen G: "Newick's 8:45" Tree Format Standard. 1990. Available from: http://evolution.genetics.washington.edu/phylip/newick_doc.html
- Levitt M, Gerstein M: A unified statistical framework for sequence comparison and structure comparison. Proc Natl Acad Sci 1998, 95:5913–5920. CrossRef
- Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res 2000, 28:254–256. CrossRef
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32:1792–1797. CrossRef
- Eddy SR: A new generation of homology search tools based on probabilistic inference. Genome Inform 2009, 23:205–211. CrossRef
- Howe K, Bateman A, Durbin R: QuickTree: building huge neighbour-joining trees of protein sequences. Bioinformatics 2002, 18:1546–1547. CrossRef
- Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4.0b10. Sinauer Associates, Sunderland, Massachusetts; 2003.
- Barrell D, Dimmer E, Huntley RP, Binns D, O’Donovan C, Apweiler R: The GOA database in 2009–an integrated gene ontology annotation resource. Nucleic Acids Res 2009, 37:396–403. CrossRef
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 2155:403–410.
- Evolutionarily consistent families in SCOP: sequence, structure and function
- Open Access
- Available under Open Access This content is freely available online to anyone, anywhere at any time.
BMC Structural Biology
- Online Date
- October 2012
- Online ISSN
- BioMed Central
- Additional Links
- Author Affiliations
- 1. Department of Computer Science, University of Bristol, The Merchant Venturers Building, Room 3.16, Woodland Road, Bristol, UK
- 2. Department of Structural Biology, Stanford University School of Medicine, Stanford, 94305, CA, USA