In Silico Characterization of Proteins InterPro and Proteome Analysis

  • Nicola Jane Mulder
  • Manuela Pruess
  • Rolf Apweiler


The main problem we aim to solve in this chapter is the quick and reliable elucida- tion of protein function and large-scale analysis of whole proteomes (protein compo- nent of genomes). This problem arose with the advancement of DNA sequencing technologies and the dawning of the genome sequencing era. Previously, unclassified DNA sequences trickled into the public databases from bench scientists working on experimental investigation of the function of the gene products. However, currently the raw sequences are flooding in with a distinct lack of accompanying annotation, result- ing in a requirement for automatic in silico protein sequence analysis tools. Tradition- ally, scientists use sequence similarity searches to compare a query sequence to those of known function, but this method has its limitations and relies on the quality of exist- ing data. Here we describe improved methods for protein sequence classification using protein signatures.


Gene Ontology Hide Markov Model Query Sequence Regular Expression Enzyme Commission Number 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Mulder, N. J., Apweiler, R., Attwood, T. K., et al. (2002) InterPro-an integrated docu-mentation resource for protein families, domains and functional sites. Brief. Bioinform. 3(3), 225–235.PubMedCrossRefGoogle Scholar
  2. 2.
    Pruess, M., Fleischmann, W., Kanapin, A., et al. (2003) The Proteome Analysis database: a tool for the in silico analysis of whole proteomes. Nucl. Acids Res. 31, 414–417.PubMedCrossRefGoogle Scholar
  3. 3.
    Sigrist, J. A, Cerutti, L., Hulo, N., et al. (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief. Bioinform. 3, 265–274.PubMedCrossRefGoogle Scholar
  4. 4.
    Gribskov, M., Luthy, R., and Eisenberg, D. (1990) Profile analysis. Methods Enzymol. 183, 146–159.PubMedCrossRefGoogle Scholar
  5. 5.
    Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235(5), 1501–1531.PubMedCrossRefGoogle Scholar
  6. 6.
    Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cam-bridge, UK.CrossRefGoogle Scholar
  7. 7.
    Eddy, S. HMMER2 Profile hidden Markov models for biological sequence analysis. []
  8. 8.
    Falquet, L., Pagni, M., Bucher, P, et al. (2002) The PROSITE database, its status in 2002. Nucl. Acids Res. 30, 235–238.PubMedCrossRefGoogle Scholar
  9. 9.
    Attwood, T. K, Bradley, P., Flower, D. R., et al. (2003) PRINTS and its automatic supple-ment pre-PRINTS. Nucl. Acids Res. 31(1), 400–402.PubMedCrossRefGoogle Scholar
  10. 10.
    Bateman, A., Birney, E., Cerruti, L., et al. (2002) The Pfam protein families database. Nucl. Acids Res. 30(1), 276–280.PubMedCrossRefGoogle Scholar
  11. 11.
    Letunic, I., Goodstadt, L., Dickens, N. J., et al. (2002) Recent improvements to the SMART domain-based sequence annotation resource. Nucl. Acids Res. 30(1), 242–244.PubMedCrossRefGoogle Scholar
  12. 12.
    Haft, D. H., Selengut, J. D., and White, O. (2003) The TIGRFAMs database of protein families. Nucl. Acids Res. 31, 371–373.PubMedCrossRefGoogle Scholar
  13. 13.
    Barker, W. C., Pfeiffer, F., and George, D. G. (1996) Superfamily classification in PIR-International Protein Sequence Database. Methods Enzymol. 266, 59–71.PubMedCrossRefGoogle Scholar
  14. 14.
    Gough, J. and Chothia, C. (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucl. Acids Res. 30(1), 268–272.PubMedCrossRefGoogle Scholar
  15. 15.
    Corpet, F., Servant, F., Gouzy, J., and Kahn, D. (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucl. Acids Res. 28, 267–269.PubMedCrossRefGoogle Scholar
  16. 16.
    Kriventseva, E. V., Servant, F., and Apweiler, R. (2003) Improvements to CluSTr: the database of SWISSβPROT+TrEMBL protein clusters. Nucl. Acids Res. 31, 388–389.PubMedCrossRefGoogle Scholar
  17. 17.
    Boeckmann, B., Bairoch, A., Apweiler, R.,’ et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res. 31, 365–370.PubMedCrossRefGoogle Scholar
  18. 18.
    Lo Conte, L., Brenner, S. E., Hubbard, T. J., Chothia, C., and Murzin. A. G. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucl. Acids Res. 30, 264–267.PubMedCrossRefGoogle Scholar
  19. 19.
    The Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433.CrossRefGoogle Scholar
  20. 20.
    Pearl, F. M., Lee, D., Bray, J. E., Buchan, D. W., Shepherd, A. J., and Orengo, A. (2002) The CATH extended protein-family database: providing structural annotations for genome sequences. Protein Sci. 11, 233–244.PubMedCrossRefGoogle Scholar
  21. 21.
    Zdobnov, E. M. and Apweiler, R. (2001) InterProScan-an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 17(9), 847–848.PubMedCrossRefGoogle Scholar
  22. 22.
    Sander, C., and Schneider, R. (1991) Database of homology derived protein structures and the structural meaning of sequence alignment. Proteins. 9, 56–68.PubMedCrossRefGoogle Scholar
  23. 23.
    Westbrook J., Feng, Z., Jain, S., et al. (2002) The Protein Data Bank: Unifying the Archive. Nucl. Acids Res. 30, 245–248.PubMedCrossRefGoogle Scholar
  24. 24.
    Stoesser, G., Baker, W., van den Broek, A., et al. (2003) The EMBL Nucleotide Sequence Database: major new developments. Nucl. Acids Res. 31, 17–22.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press Inc., Totowa, NJ 2005

Authors and Affiliations

  • Nicola Jane Mulder
    • 1
  • Manuela Pruess
    • 1
  • Rolf Apweiler
    • 1
  1. 1.European Bioinformatics InstituteCambridge

Personalised recommendations