Evolutionarily consistent families in SCOP: sequence, structure and function
SCOP is a hierarchical domain classification system for proteins of known structure. The superfamily level has a clear definition: Protein domains belong to the same superfamily if there is structural, functional and sequence evidence for a common evolutionary ancestor. Superfamilies are sub-classified into families, however, there is not such a clear basis for the family level groupings. Do SCOP families group together domains with sequence similarity, do they group domains with similar structure or by common function? It is these questions we answer, but most importantly, whether each family represents a distinct phylogenetic group within a superfamily.
Several phylogenetic trees were generated for each superfamily: one derived from a multiple sequence alignment, one based on structural distances, and the final two from presence/absence of GO terms or EC numbers assigned to domains. The topologies of the resulting trees and confidence values were compared to the SCOP family classification.
We show that SCOP family groupings are evolutionarily consistent to a very high degree with respect to classical sequence phylogenetics. The trees built from (automatically generated) structural distances correlate well, but are not always consistent with SCOP (hand annotated) groupings. Trees derived from functional data are less consistent with the family level than those from structure or sequence, though the majority still agree. Much of GO and EC annotation applies directly to one family or subset of the family; relatively few terms apply at the superfamily level. Maximum sequence diversity within a family is on average 22% but close to zero for superfamilies.
- Murzin, AG, Brenner, SE, Hubbard, T, Chothia, C (1995) SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: pp. 536-540
- Andreeva, A, Howorth, D, Brenner, SE, Hubbard, TJ, Chothia, C, Murzin, AG (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32: pp. 226-229 CrossRef
- Andreeva, A, Howorth, D, Chandonia, JM, Brenner, SE, Hubbard, TJP, Chothia, C, Murzin, AG (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acid Res 36: pp. 419-425 CrossRef
- Berman, HM, Westbrook, J, Feng, Z, Gilliland, G, Bhat, TN (2000) The Protein Data Bank. Nucleic Acids Res 28: pp. 235-242 CrossRef
- Gough, J, Chothia, C (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 30: pp. 268-272 CrossRef
- Chothia, C, Gough, J, Vogel, C, Teichmann, SA (2003) Evolution of the protein repertoire. Science 300: pp. 1701-1703 CrossRef
- Holm, L, Sander, C (1996) Mapping the protein universe. Science 273: pp. 595-602 CrossRef
- Overington, JP, Al-Lazikani, B, Hopkins, AL (1996) How many drug targets are there?. Nature 5: pp. 993-996
- Ashburner, M, Ball, CA, Blake, JA, Botstein, D, Butler, H (2000) Gene ontology: tool for the unification of biology. Nat Genet 25: pp. 25-29 CrossRef
- Hill, DP, Davis, AP, Richardson, JE, Corradi, JP, Ringwald, M (2001) Program description: strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. Genomics 74: pp. 121-128 CrossRef
- Rokas, A, Williams, BL, King, N, Carroll, SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425: pp. 798-804 CrossRef
- Hillis, DM, Bull, JJ (1993) An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst Biol 42: pp. 182-192
- Finn, RD, Mistry, J, Tate, J, Coggill, P, Heger, A (2010) The Pfam protein families database. Nucleic Acids Res 38: pp. D211-D222 CrossRef
- Pethica, R, Barker, G, Kovacs, T, Gough, J (2010) TreeVector: scalable, interactive, phylogenetic trees for the web. PLoS One 5: pp. e8934 CrossRef
- Gough, J, Chothia, C (2004) The linked conservation of structure and function in a family of high diversity: the monomeric cupredoxins. Structure 12: pp. 917-925 CrossRef
- Olsen, G (1990) "Newick's 8:45" Tree Format Standard.
- Levitt, M, Gerstein, M (1998) A unified statistical framework for sequence comparison and structure comparison. Proc Natl Acad Sci 95: pp. 5913-5920 CrossRef
- Brenner, SE, Koehl, P, Levitt, M (2000) The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res 28: pp. 254-256 CrossRef
- Edgar, RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: pp. 1792-1797 CrossRef
- Eddy, SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23: pp. 205-211 CrossRef
- Howe, K, Bateman, A, Durbin, R (2002) QuickTree: building huge neighbour-joining trees of protein sequences. Bioinformatics 18: pp. 1546-1547 CrossRef
- Swofford, DL (2003) PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4.0b10. Sinauer Associates, Sunderland, Massachusetts
- Barrell, D, Dimmer, E, Huntley, RP, Binns, D, O’Donovan, C, Apweiler, R (2009) The GOA database in 2009–an integrated gene ontology annotation resource. Nucleic Acids Res 37: pp. 396-403 CrossRef
- Altschul, SF, Gish, W, Miller, W, Myers, EW, Lipman, DJ (1990) Basic local alignment search tool. J Mol Biol 2155: pp. 403-410
- Evolutionarily consistent families in SCOP: sequence, structure and function
- Open Access
- Available under Open Access This content is freely available online to anyone, anywhere at any time.
BMC Structural Biology
- Online Date
- October 2012
- Online ISSN
- BioMed Central
- Additional Links
- Author Affiliations
- 1. Department of Computer Science, University of Bristol, The Merchant Venturers Building, Room 3.16, Woodland Road, Bristol, UK
- 2. Department of Structural Biology, Stanford University School of Medicine, Stanford, 94305, CA, USA