Advertisement

Molecular Biotechnology

, Volume 38, Issue 2, pp 165–177 | Cite as

In Silico Characterization of Proteins: UniProt, InterPro and Integr8

  • Nicola Jane Mulder
  • Paul Kersey
  • Manuela Pruess
  • Rolf Apweiler
Review

Abstract

Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.

Keywords

Bioinformatics Databases Protein sequences Protein signatures Annotation Proteomics 

Notes

Acknowledgements

UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U01 HG02712-01. Additional support comes from the European Commission’s grant 021902RII3, from the NIH grants 1R01HGO2273-01, HHSN266200400 061C, NCI-caBIGICR-10-10-01 and ITR-0205470, and the Swiss Federal Government. InterPro was funded by the award of grant number QLRI-CT-2000–00517 and in part by grant number QLRI-CT-2001000015 from the European Union under the RTD programme “Quality of Life and Management of Living Resources”. InterPro is a member database of the MRC-funded eFamily project. Genome Reviews and Integr8 have been funded or are funded, respectively, by the European Commission’s grants QLRICT-2001000015, and 021902RII3.

References

  1. 1.
    The UniProt Consortium (2007). The Universal Protein Resource (UniProt). Nucleic Acids Research, 35, D193–D197.CrossRefGoogle Scholar
  2. 2.
    Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Garcia-Pastor, M. P., Plaister, S., Sobhany, S., Stoehr, P., Vaughan, R., Wu, D., Zhu, W., & Apweiler, R. (2007). EMBL Nucleotide sequence database in 2006. Nucleic Acids Research, 35, D16–D20.PubMedCrossRefGoogle Scholar
  3. 3.
    Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P. S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J. D., Sigrist, C. J., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., & Yeats, C. (2007). New developments in the InterPro database. Nucleic Acids Research, 35, D224–D228.PubMedCrossRefGoogle Scholar
  4. 4.
    Kersey, P., Bower, L., Morris, L., Horne, A., Petryszak, R., Kanz, C., Kanapin, A., Das, U., Michoud, K., Phan, I., Gattiker, A., Kulikova, T., Faruque, N., Duggan, K., Mclaren, P., Reimholz, B., Duret, L., Penel, S., Reuter, I., & Apweiler, R. (2005). Integr8 and genome reviews: Integrated views of complete genomes and proteomes. Nucleic Acids Research, 33, D297–D302.PubMedCrossRefGoogle Scholar
  5. 5.
    Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Miller, V., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L., Tatusova, T. A., Wagner, L., & Yaschenko, E. (2007). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 35, D5–D12.PubMedCrossRefGoogle Scholar
  6. 6.
    Okubo, K., Sugawara, H., Gojobori, T., & Tateno, Y. (2006). DDBJ in preparation for overview of research activities behind data submissions. Nucleic Acids Research, 34, D6–D9.PubMedCrossRefGoogle Scholar
  7. 7.
    Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Wheeler, D. L. (2007). GenBank. Nucleic Acids Research, 35, D21–D25.PubMedCrossRefGoogle Scholar
  8. 8.
    Pruitt, K. D., Tatusova, T., & Maglott, D. R. (2007). NCBI reference sequences (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 35, D61–D65.PubMedCrossRefGoogle Scholar
  9. 9.
    Dayhoff, M. O. (1978). Atlas of protein sequence and structure, (Vol. 5, Supplement 3). Washington, DC: National Biomedical Research Foundation.Google Scholar
  10. 10.
    Leinonen, R., Diez, F. G., Binns, D., Fleischmann, W., Lopez, R., & Apweiler, R. (2004). UniProt archive. Bioinformatics, 20, 3236–3237.PubMedCrossRefGoogle Scholar
  11. 11.
    Hubbard, T. J., Aken, B. L., Beal, K., Ballester, B., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cunningham, F., Cutts, T., Down, T., Dyer, S. C., Fitzgerald, S., Fernandez-Banet, J., Graf, S., Haider, S., Hammond, M., Herrero, J., Holland, R., Howe, K., Johnson, N., Kahari, A., Keefe, D., Kokocinski, F., Kulesha, E., Lawson, D., Longden, I., Melsopp, C., Megy, K., Meidl, P., Ouverdin, B., Parker, A., Prlic, A., Rice, S., Rios, D., Schuster, M., Sealy, I., Severin, J., Slater, G., Smedley, D., Spudich, G., Trevanion, S., Vilella, A., Vogel, J., White, S., Wood, M., Cox, T., Curwen, V., Durbin, R., Fernandez-Suarez, X. M., Flicek, P., Kasprzyk, A., Proctor, G., Searle, S., Smith, J., Ureta-Vidal, A., & Birney, E. (2007). Ensembl 2007. Nucleic Acids Research, 35, D610–D617.PubMedCrossRefGoogle Scholar
  12. 12.
    Kouranov, A., Xie, L., de la Cruz, J., Chen, L., Westbrook, J., Bourne, P. E., & Berman, H. M. (2006). The RCSB PDB information portal for structural genomics. Nucleic Acids Research, 34, D302–D305.PubMedCrossRefGoogle Scholar
  13. 13.
    Wieser, D., Kretschmann, E., & Apweiler, R. (2004). Filtering erroneous protein annotation. Bioinformatics, 20, i342–i347.PubMedCrossRefGoogle Scholar
  14. 14.
    Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A. H., Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist, C. J., Lachaize, C., Veuthey, A. L., Gasteiger, E., & Bairoch, A. (2003). Automated annotation of microbial proteomes in SWISS-PROT. Computational Biological Chemistry, 27, 49–58.CrossRefGoogle Scholar
  15. 15.
    Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein annotation with the C4.5 data mining algorith applied on Swiss-Prot. Bioinformatics, 17, 920–926.PubMedCrossRefGoogle Scholar
  16. 16.
    Wu, C. H., Nikolskaya, A., Huang, H., Yeh, L. S., Natale, D. A., Vinayaka, C. R., Hu, Z. Z., Mazumder, R., Kumar, S., Kourtesis, P., Ledley, R. S., Suzek, B. E., Arminski, L., Chen, Y., Zhang, J., Cardenas, J. L., Chung, S., Castro-Alvear, J., Dinkov, G., & Barker, W. C. (2004). PIRSF: Family classification system at the protein information resource. Nucleic Acids Research, 32, D112–D114.PubMedCrossRefGoogle Scholar
  17. 17.
    Natale, D. A., Vinayaka, C. R., & Wu, C. H. (2004). Large-scale, classification-driven, rule-based functional annotation of proteins. In S. Subramaniam (Ed.), Encyclopedia of genetics, genomics, proteomics and bioinformatics. Bioinformatics volume. John Wiley & Sons, Ltd.Google Scholar
  18. 18.
    Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283.PubMedCrossRefGoogle Scholar
  19. 19.
    Gene Ontology Consortium (2006). The Gene Ontology (GO) project in 2006. Nucleic Acids Research, 34, D322–D326.Google Scholar
  20. 20.
    Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., & Apweiler, R. (2004). The Gene Ontology Annotation (GOA) database: Sharing knowledge in UniProt with Gene Ontology. Nucleic Acids Research, 32, D262–D266.PubMedCrossRefGoogle Scholar
  21. 21.
    Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I, Bridge, A., Derow, C., Dimmer, E., Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A., Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J., Robbe, K., Roechert, B., Thorneycroft, D., Zhang, Y, Apweiler, R., & Hermjakob, H. (2007). IntAct-open source resource for molecular interaction data. Nucleic Acids Research, 35, D561–D565.PubMedCrossRefGoogle Scholar
  22. 22.
    Rawlings, N. D., Morton, F. R., & Barrett, A. J. (2006). MEROPS: The peptidase database. Nucleic Acids Res, 34, D270–D272.PubMedCrossRefGoogle Scholar
  23. 23.
    Crosby, M. A., Goodman, J. L., Strelets, V. B., Zhang, P., & Gelbart, W. M. (2007). FlyBase: Genomes by the dozen. Nucleic Acids Research, 35, D486–D491.PubMedCrossRefGoogle Scholar
  24. 24.
    Eppig, J. T., Blake, J. A., Bult, C. J., Kadin, J. A., & Richardson, J. E. (2007). The mouse genome database (MGD): New features facilitating a model system. Nucleic Acids Research, 35, D630–D637.PubMedCrossRefGoogle Scholar
  25. 25.
    Bieri, T., Blasiar, D., Ozersky, P., Antoshechkin, I., Bastiani, C., Canaran, P., Chan, J., Chen, N., Chen, W. J., Davis, P., Fiedler, T. J., Girard, L., Han, M., Harris, T. W., Kishore, R., Lee, R., McKay, S., Muller, H. M., Nakamura, C., Petcherski, A., Rangarajan, A., Rogers, A., Schindelman, G., Schwarz, E. M., Spooner, W., Tuli, M. A., Van Auken, K., Wang, D., Wang, X., Williams, G., Durbin, R., Stein, L. D., Sternberg, P. W., & Spieth, J. (2007). WormBase: New content and better access. Nucleic Acids Research, 35, D506–D510.PubMedCrossRefGoogle Scholar
  26. 26.
    Nash, R., Weng, S., Hitz, B., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hirschman, J. E., Hong, E. L., Livstone, M. S., Oughtred, R., Park, J., Skrzypek, M., Theesfeld, C. L., Binkley, G., Dong, Q., Lane, C., Miyasato, S., Sethuraman, A., Schroeder, M., Dolinski, K., Botstein, D., & Cherry, J. M. (2007). Expanded protein information at SGD: New pages and proteome browser. Nucleic Acids Research, 35, D468–D471.PubMedCrossRefGoogle Scholar
  27. 27.
    Rhee, S. Y., Beavis, W., Berardini, T. Z., Chen, G., Dixon, D., Doyle, A., Garcia-Hernandez, M., Huala, E., Lander, G., Montoya, M., Miller, N., Mueller, L. A., Mundodi, S., Reiser, L., Tacklind, J., Weems, D. C., Wu, Y., Xu, I., Yoo, D., Yoon, J., & Zhang, P. (2003). The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Research, 31, 224–228.PubMedCrossRefGoogle Scholar
  28. 28.
    Sigrist, C. J. A, Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., & Bucher, P. (2002). PROSITE: A documented database using patterns and profiles as motif descriptors. Briefings in Bioinformatics, 3, 265–274.PubMedCrossRefGoogle Scholar
  29. 29.
    Gribskov, M., Luthy, R., & Eisenberg, D. (1990). Profile analysis. Methods in Enzymology, 183, 146–159.PubMedCrossRefGoogle Scholar
  30. 30.
    Krogh, A., Brown, M., Mian, I. S., Sjolander, K., & Haussler, D. (1994). Hidden Markov models in computational biology. Applications to protein modeling. Journal of Molecular Biology, 235(5), 1501–1531.PubMedCrossRefGoogle Scholar
  31. 31.
    Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press.Google Scholar
  32. 32.
    Eddy, S. HMMER2 Profile hidden Markov models for biological sequence analysis. [http://www.hmmer.wustl.edu/].
  33. 33.
    Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P. S., Pagni, M., & Sigrist, C. J. A. (2006). The PROSITE database. Nucleic Acids Research, 34, D227–D230.PubMedCrossRefGoogle Scholar
  34. 34.
    Attwood, T. K., Bradley, P., Flower, D. R., Gaulton, A., Maudling, N., Mitchell, A. L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., & Zygouri, C. (2003). PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research, 31, 400–402.PubMedCrossRefGoogle Scholar
  35. 35.
    Finn, R. D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S. R., Sonnhammer, E. L., & Bateman, A. (2006). Pfam: Clans, web tools and services. Nucleic Acids Research, 34, D247–D251.PubMedCrossRefGoogle Scholar
  36. 36.
    Letunic, I., Copley, R. R., Pils, B., Pinkert, S., Schultz, J., & Bork, P. (2006). SMART 5: Domains in the context of genomes and networks. Nucleic Acids Research, 34, D257–D260.PubMedCrossRefGoogle Scholar
  37. 37.
    Selengut, J. D., Haft, D. H., Davidsen, T., Ganapathy, A., Gwinn-Giglio, M., Nelson, W. C., Richter, A. R., & White, O. (2007). TIGRFAMs and Genome Properties: Tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Research, 35, D260–D264.PubMedCrossRefGoogle Scholar
  38. 38.
    Mi, H., Guo, N., Kejariwal, A., & Thomas, P. D. (2007). PANTHER version 6: Protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Research, 35, D247–D252.PubMedCrossRefGoogle Scholar
  39. 39.
    Wilson, D., Madera, M., Vogel, C., Chothia, C., & Gough, J. (2007). The SUPERFAMILY database in 2007: Families and functions. Nucleic Acids Research, 35, D308–D313.PubMedCrossRefGoogle Scholar
  40. 40.
    Yeats, C., Maibaum, M., Marsden, R., Dibley, M., Lee, D., Addou, S., & Orengo, C. A. (2006). Gene3D: Modelling protein structure, function and evolution. Nucleic Acids Research, 34, D281–D284.PubMedCrossRefGoogle Scholar
  41. 41.
    Bru, C., Courcelle, E., Carrere, S., Beausse, Y., Dalmar, S., & Kahn, D. (2005). The ProDom database of protein domain families: More emphasis on 3D. Nucleic Acids Research, 33, D212–D215.PubMedCrossRefGoogle Scholar
  42. 42.
    Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J., Chothia, C., & Murzin, A. G. (2004). SCOP database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Research, 32, D226–D229.PubMedCrossRefGoogle Scholar
  43. 43.
    Greene, L. H., Lewis, T. E., Addou, S., Cuff, A., Dallman, T., Dibley, M., Redfern, O., Pearl, F., Nambudiry, R., Reid, A., Sillitoe, I., Yeats, C., Thornton, J. M., & Orengo, C. A. (2007). The CATH domain structure database: New protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Research, 35, D291–D297.PubMedCrossRefGoogle Scholar
  44. 44.
    Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., & Lopez, R. (2005). InterProScan: Protein domains identifier. Nucleic Acids Research, 33, W116–W120.PubMedCrossRefGoogle Scholar
  45. 45.
    Kopp, J., & Schwede, T. (2006). The SWISS-MODEL repository: New features and functionalities. Nucleic Acids Research, 34, D315–D318.PubMedCrossRefGoogle Scholar
  46. 46.
    Pieper, U., Eswar, N., Davis, F. P., Braberg, H., Madhusudhan, M. S., Rossi, A., Marti-Renom, M., Karchin, R., Webb, B. M., Eramian, D., Shen, M. Y., Kelly, L., Melo, F., & Sali, A. (2006). MODBASE: A database of annotated comparative protein structure models and associated resources. Nucleic Acids Research, 34, D291–D295.PubMedCrossRefGoogle Scholar
  47. 47.
    Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., & Apweiler, R. (2004). The international protein index: An integrated database for proteomics experiments. Proteomics, 4, 1985–1988.PubMedCrossRefGoogle Scholar
  48. 48.
    Sterk, P., Kersey, P. J., & Apweiler, R. (2006). Genome Reviews: Standardizing content and representation of information about complete genomes. Omics, 10, 114–118.PubMedCrossRefGoogle Scholar
  49. 49.
    McGinnis, S., & Madden, T. L. (2004). BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research, 32, W20–W25.PubMedCrossRefGoogle Scholar
  50. 50.
    Myers, E. W., & Miller, W. (1988). Optimal alignments in linear space. Computational Applied Bioscience, 4, 11–7.Google Scholar
  51. 51.
    Petryszak, P., Kretschmann, E., Wieser, D., & Apweiler, R. (2005). The predictive power of the CluSTr database. Bioinformatics, 21(18), 3604–3609.PubMedCrossRefGoogle Scholar
  52. 52.
    Dodge, C., Schneider, R., & Sander, C. (1998). The HSSP database of protein structure-sequence alignments and family profiles. Nucleic Acids Research, 26, 313–315.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press Inc. 2007

Authors and Affiliations

  • Nicola Jane Mulder
    • 1
  • Paul Kersey
    • 1
  • Manuela Pruess
    • 1
  • Rolf Apweiler
    • 1
  1. 1.EMBL Outstation - European Bioinformatics InstituteHinxton, CambridgeUK

Personalised recommendations