Skip to main content

Advertisement

Log in

In Silico Characterization of Proteins: UniProt, InterPro and Integr8

  • Review
  • Published:
Molecular Biotechnology Aims and scope Submit manuscript

Abstract

Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. The UniProt Consortium (2007). The Universal Protein Resource (UniProt). Nucleic Acids Research, 35, D193–D197.

    Article  Google Scholar 

  2. Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Garcia-Pastor, M. P., Plaister, S., Sobhany, S., Stoehr, P., Vaughan, R., Wu, D., Zhu, W., & Apweiler, R. (2007). EMBL Nucleotide sequence database in 2006. Nucleic Acids Research, 35, D16–D20.

    Article  PubMed  CAS  Google Scholar 

  3. Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P. S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J. D., Sigrist, C. J., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., & Yeats, C. (2007). New developments in the InterPro database. Nucleic Acids Research, 35, D224–D228.

    Article  PubMed  CAS  Google Scholar 

  4. Kersey, P., Bower, L., Morris, L., Horne, A., Petryszak, R., Kanz, C., Kanapin, A., Das, U., Michoud, K., Phan, I., Gattiker, A., Kulikova, T., Faruque, N., Duggan, K., Mclaren, P., Reimholz, B., Duret, L., Penel, S., Reuter, I., & Apweiler, R. (2005). Integr8 and genome reviews: Integrated views of complete genomes and proteomes. Nucleic Acids Research, 33, D297–D302.

    Article  PubMed  CAS  Google Scholar 

  5. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Miller, V., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L., Tatusova, T. A., Wagner, L., & Yaschenko, E. (2007). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 35, D5–D12.

    Article  PubMed  CAS  Google Scholar 

  6. Okubo, K., Sugawara, H., Gojobori, T., & Tateno, Y. (2006). DDBJ in preparation for overview of research activities behind data submissions. Nucleic Acids Research, 34, D6–D9.

    Article  PubMed  CAS  Google Scholar 

  7. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Wheeler, D. L. (2007). GenBank. Nucleic Acids Research, 35, D21–D25.

    Article  PubMed  CAS  Google Scholar 

  8. Pruitt, K. D., Tatusova, T., & Maglott, D. R. (2007). NCBI reference sequences (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 35, D61–D65.

    Article  PubMed  CAS  Google Scholar 

  9. Dayhoff, M. O. (1978). Atlas of protein sequence and structure, (Vol. 5, Supplement 3). Washington, DC: National Biomedical Research Foundation.

  10. Leinonen, R., Diez, F. G., Binns, D., Fleischmann, W., Lopez, R., & Apweiler, R. (2004). UniProt archive. Bioinformatics, 20, 3236–3237.

    Article  PubMed  CAS  Google Scholar 

  11. Hubbard, T. J., Aken, B. L., Beal, K., Ballester, B., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cunningham, F., Cutts, T., Down, T., Dyer, S. C., Fitzgerald, S., Fernandez-Banet, J., Graf, S., Haider, S., Hammond, M., Herrero, J., Holland, R., Howe, K., Johnson, N., Kahari, A., Keefe, D., Kokocinski, F., Kulesha, E., Lawson, D., Longden, I., Melsopp, C., Megy, K., Meidl, P., Ouverdin, B., Parker, A., Prlic, A., Rice, S., Rios, D., Schuster, M., Sealy, I., Severin, J., Slater, G., Smedley, D., Spudich, G., Trevanion, S., Vilella, A., Vogel, J., White, S., Wood, M., Cox, T., Curwen, V., Durbin, R., Fernandez-Suarez, X. M., Flicek, P., Kasprzyk, A., Proctor, G., Searle, S., Smith, J., Ureta-Vidal, A., & Birney, E. (2007). Ensembl 2007. Nucleic Acids Research, 35, D610–D617.

    Article  PubMed  CAS  Google Scholar 

  12. Kouranov, A., Xie, L., de la Cruz, J., Chen, L., Westbrook, J., Bourne, P. E., & Berman, H. M. (2006). The RCSB PDB information portal for structural genomics. Nucleic Acids Research, 34, D302–D305.

    Article  PubMed  CAS  Google Scholar 

  13. Wieser, D., Kretschmann, E., & Apweiler, R. (2004). Filtering erroneous protein annotation. Bioinformatics, 20, i342–i347.

    Article  PubMed  CAS  Google Scholar 

  14. Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A. H., Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist, C. J., Lachaize, C., Veuthey, A. L., Gasteiger, E., & Bairoch, A. (2003). Automated annotation of microbial proteomes in SWISS-PROT. Computational Biological Chemistry, 27, 49–58.

    Article  CAS  Google Scholar 

  15. Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein annotation with the C4.5 data mining algorith applied on Swiss-Prot. Bioinformatics, 17, 920–926.

    Article  PubMed  CAS  Google Scholar 

  16. Wu, C. H., Nikolskaya, A., Huang, H., Yeh, L. S., Natale, D. A., Vinayaka, C. R., Hu, Z. Z., Mazumder, R., Kumar, S., Kourtesis, P., Ledley, R. S., Suzek, B. E., Arminski, L., Chen, Y., Zhang, J., Cardenas, J. L., Chung, S., Castro-Alvear, J., Dinkov, G., & Barker, W. C. (2004). PIRSF: Family classification system at the protein information resource. Nucleic Acids Research, 32, D112–D114.

    Article  PubMed  CAS  Google Scholar 

  17. Natale, D. A., Vinayaka, C. R., & Wu, C. H. (2004). Large-scale, classification-driven, rule-based functional annotation of proteins. In S. Subramaniam (Ed.), Encyclopedia of genetics, genomics, proteomics and bioinformatics. Bioinformatics volume. John Wiley & Sons, Ltd.

  18. Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283.

    Article  PubMed  CAS  Google Scholar 

  19. Gene Ontology Consortium (2006). The Gene Ontology (GO) project in 2006. Nucleic Acids Research, 34, D322–D326.

  20. Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., & Apweiler, R. (2004). The Gene Ontology Annotation (GOA) database: Sharing knowledge in UniProt with Gene Ontology. Nucleic Acids Research, 32, D262–D266.

    Article  PubMed  CAS  Google Scholar 

  21. Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I, Bridge, A., Derow, C., Dimmer, E., Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A., Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J., Robbe, K., Roechert, B., Thorneycroft, D., Zhang, Y, Apweiler, R., & Hermjakob, H. (2007). IntAct-open source resource for molecular interaction data. Nucleic Acids Research, 35, D561–D565.

    Article  PubMed  CAS  Google Scholar 

  22. Rawlings, N. D., Morton, F. R., & Barrett, A. J. (2006). MEROPS: The peptidase database. Nucleic Acids Res, 34, D270–D272.

    Article  PubMed  CAS  Google Scholar 

  23. Crosby, M. A., Goodman, J. L., Strelets, V. B., Zhang, P., & Gelbart, W. M. (2007). FlyBase: Genomes by the dozen. Nucleic Acids Research, 35, D486–D491.

    Article  PubMed  CAS  Google Scholar 

  24. Eppig, J. T., Blake, J. A., Bult, C. J., Kadin, J. A., & Richardson, J. E. (2007). The mouse genome database (MGD): New features facilitating a model system. Nucleic Acids Research, 35, D630–D637.

    Article  PubMed  CAS  Google Scholar 

  25. Bieri, T., Blasiar, D., Ozersky, P., Antoshechkin, I., Bastiani, C., Canaran, P., Chan, J., Chen, N., Chen, W. J., Davis, P., Fiedler, T. J., Girard, L., Han, M., Harris, T. W., Kishore, R., Lee, R., McKay, S., Muller, H. M., Nakamura, C., Petcherski, A., Rangarajan, A., Rogers, A., Schindelman, G., Schwarz, E. M., Spooner, W., Tuli, M. A., Van Auken, K., Wang, D., Wang, X., Williams, G., Durbin, R., Stein, L. D., Sternberg, P. W., & Spieth, J. (2007). WormBase: New content and better access. Nucleic Acids Research, 35, D506–D510.

    Article  PubMed  CAS  Google Scholar 

  26. Nash, R., Weng, S., Hitz, B., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hirschman, J. E., Hong, E. L., Livstone, M. S., Oughtred, R., Park, J., Skrzypek, M., Theesfeld, C. L., Binkley, G., Dong, Q., Lane, C., Miyasato, S., Sethuraman, A., Schroeder, M., Dolinski, K., Botstein, D., & Cherry, J. M. (2007). Expanded protein information at SGD: New pages and proteome browser. Nucleic Acids Research, 35, D468–D471.

    Article  PubMed  CAS  Google Scholar 

  27. Rhee, S. Y., Beavis, W., Berardini, T. Z., Chen, G., Dixon, D., Doyle, A., Garcia-Hernandez, M., Huala, E., Lander, G., Montoya, M., Miller, N., Mueller, L. A., Mundodi, S., Reiser, L., Tacklind, J., Weems, D. C., Wu, Y., Xu, I., Yoo, D., Yoon, J., & Zhang, P. (2003). The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Research, 31, 224–228.

    Article  PubMed  CAS  Google Scholar 

  28. Sigrist, C. J. A, Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., & Bucher, P. (2002). PROSITE: A documented database using patterns and profiles as motif descriptors. Briefings in Bioinformatics, 3, 265–274.

    Article  PubMed  CAS  Google Scholar 

  29. Gribskov, M., Luthy, R., & Eisenberg, D. (1990). Profile analysis. Methods in Enzymology, 183, 146–159.

    Article  PubMed  CAS  Google Scholar 

  30. Krogh, A., Brown, M., Mian, I. S., Sjolander, K., & Haussler, D. (1994). Hidden Markov models in computational biology. Applications to protein modeling. Journal of Molecular Biology, 235(5), 1501–1531.

    Article  PubMed  CAS  Google Scholar 

  31. Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  32. Eddy, S. HMMER2 Profile hidden Markov models for biological sequence analysis. [http://www.hmmer.wustl.edu/].

  33. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P. S., Pagni, M., & Sigrist, C. J. A. (2006). The PROSITE database. Nucleic Acids Research, 34, D227–D230.

    Article  PubMed  CAS  Google Scholar 

  34. Attwood, T. K., Bradley, P., Flower, D. R., Gaulton, A., Maudling, N., Mitchell, A. L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., & Zygouri, C. (2003). PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research, 31, 400–402.

    Article  PubMed  CAS  Google Scholar 

  35. Finn, R. D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S. R., Sonnhammer, E. L., & Bateman, A. (2006). Pfam: Clans, web tools and services. Nucleic Acids Research, 34, D247–D251.

    Article  PubMed  CAS  Google Scholar 

  36. Letunic, I., Copley, R. R., Pils, B., Pinkert, S., Schultz, J., & Bork, P. (2006). SMART 5: Domains in the context of genomes and networks. Nucleic Acids Research, 34, D257–D260.

    Article  PubMed  CAS  Google Scholar 

  37. Selengut, J. D., Haft, D. H., Davidsen, T., Ganapathy, A., Gwinn-Giglio, M., Nelson, W. C., Richter, A. R., & White, O. (2007). TIGRFAMs and Genome Properties: Tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Research, 35, D260–D264.

    Article  PubMed  CAS  Google Scholar 

  38. Mi, H., Guo, N., Kejariwal, A., & Thomas, P. D. (2007). PANTHER version 6: Protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Research, 35, D247–D252.

    Article  PubMed  CAS  Google Scholar 

  39. Wilson, D., Madera, M., Vogel, C., Chothia, C., & Gough, J. (2007). The SUPERFAMILY database in 2007: Families and functions. Nucleic Acids Research, 35, D308–D313.

    Article  PubMed  CAS  Google Scholar 

  40. Yeats, C., Maibaum, M., Marsden, R., Dibley, M., Lee, D., Addou, S., & Orengo, C. A. (2006). Gene3D: Modelling protein structure, function and evolution. Nucleic Acids Research, 34, D281–D284.

    Article  PubMed  CAS  Google Scholar 

  41. Bru, C., Courcelle, E., Carrere, S., Beausse, Y., Dalmar, S., & Kahn, D. (2005). The ProDom database of protein domain families: More emphasis on 3D. Nucleic Acids Research, 33, D212–D215.

    Article  PubMed  CAS  Google Scholar 

  42. Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J., Chothia, C., & Murzin, A. G. (2004). SCOP database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Research, 32, D226–D229.

    Article  PubMed  CAS  Google Scholar 

  43. Greene, L. H., Lewis, T. E., Addou, S., Cuff, A., Dallman, T., Dibley, M., Redfern, O., Pearl, F., Nambudiry, R., Reid, A., Sillitoe, I., Yeats, C., Thornton, J. M., & Orengo, C. A. (2007). The CATH domain structure database: New protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Research, 35, D291–D297.

    Article  PubMed  CAS  Google Scholar 

  44. Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., & Lopez, R. (2005). InterProScan: Protein domains identifier. Nucleic Acids Research, 33, W116–W120.

    Article  PubMed  CAS  Google Scholar 

  45. Kopp, J., & Schwede, T. (2006). The SWISS-MODEL repository: New features and functionalities. Nucleic Acids Research, 34, D315–D318.

    Article  PubMed  CAS  Google Scholar 

  46. Pieper, U., Eswar, N., Davis, F. P., Braberg, H., Madhusudhan, M. S., Rossi, A., Marti-Renom, M., Karchin, R., Webb, B. M., Eramian, D., Shen, M. Y., Kelly, L., Melo, F., & Sali, A. (2006). MODBASE: A database of annotated comparative protein structure models and associated resources. Nucleic Acids Research, 34, D291–D295.

    Article  PubMed  CAS  Google Scholar 

  47. Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., & Apweiler, R. (2004). The international protein index: An integrated database for proteomics experiments. Proteomics, 4, 1985–1988.

    Article  PubMed  CAS  Google Scholar 

  48. Sterk, P., Kersey, P. J., & Apweiler, R. (2006). Genome Reviews: Standardizing content and representation of information about complete genomes. Omics, 10, 114–118.

    Article  PubMed  CAS  Google Scholar 

  49. McGinnis, S., & Madden, T. L. (2004). BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research, 32, W20–W25.

    Article  PubMed  CAS  Google Scholar 

  50. Myers, E. W., & Miller, W. (1988). Optimal alignments in linear space. Computational Applied Bioscience, 4, 11–7.

    CAS  Google Scholar 

  51. Petryszak, P., Kretschmann, E., Wieser, D., & Apweiler, R. (2005). The predictive power of the CluSTr database. Bioinformatics, 21(18), 3604–3609.

    Article  PubMed  CAS  Google Scholar 

  52. Dodge, C., Schneider, R., & Sander, C. (1998). The HSSP database of protein structure-sequence alignments and family profiles. Nucleic Acids Research, 26, 313–315.

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgements

UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U01 HG02712-01. Additional support comes from the European Commission’s grant 021902RII3, from the NIH grants 1R01HGO2273-01, HHSN266200400 061C, NCI-caBIGICR-10-10-01 and ITR-0205470, and the Swiss Federal Government. InterPro was funded by the award of grant number QLRI-CT-2000–00517 and in part by grant number QLRI-CT-2001000015 from the European Union under the RTD programme “Quality of Life and Management of Living Resources”. InterPro is a member database of the MRC-funded eFamily project. Genome Reviews and Integr8 have been funded or are funded, respectively, by the European Commission’s grants QLRICT-2001000015, and 021902RII3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicola Jane Mulder.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mulder, N.J., Kersey, P., Pruess, M. et al. In Silico Characterization of Proteins: UniProt, InterPro and Integr8. Mol Biotechnol 38, 165–177 (2008). https://doi.org/10.1007/s12033-007-9003-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12033-007-9003-x

Keywords

Navigation