In Silico Characterization of Proteins: UniProt, InterPro and Integr8

Mulder, Nicola Jane; Kersey, Paul; Pruess, Manuela; Apweiler, Rolf

doi:10.1007/s12033-007-9003-x

In Silico Characterization of Proteins: UniProt, InterPro and Integr8

Review
Published: 04 October 2007

Volume 38, pages 165–177, (2008)
Cite this article

Molecular Biotechnology Aims and scope Submit manuscript

Nicola Jane Mulder¹,
Paul Kersey¹,
Manuela Pruess¹ &
…
Rolf Apweiler¹

1364 Accesses
51 Citations
Explore all metrics

Abstract

Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accurate Prediction of Protein Sequences for Proteogenomics Data Integration

SeqWiz: a modularized toolkit for next-generation protein sequence database management and analysis

Article Open access 17 May 2023

Homology-Based Annotation of Large Protein Datasets

References

The UniProt Consortium (2007). The Universal Protein Resource (UniProt). Nucleic Acids Research, 35, D193–D197.
Article Google Scholar
Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Garcia-Pastor, M. P., Plaister, S., Sobhany, S., Stoehr, P., Vaughan, R., Wu, D., Zhu, W., & Apweiler, R. (2007). EMBL Nucleotide sequence database in 2006. Nucleic Acids Research, 35, D16–D20.
Article PubMed CAS Google Scholar
Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P. S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A. N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J. D., Sigrist, C. J., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., & Yeats, C. (2007). New developments in the InterPro database. Nucleic Acids Research, 35, D224–D228.
Article PubMed CAS Google Scholar
Kersey, P., Bower, L., Morris, L., Horne, A., Petryszak, R., Kanz, C., Kanapin, A., Das, U., Michoud, K., Phan, I., Gattiker, A., Kulikova, T., Faruque, N., Duggan, K., Mclaren, P., Reimholz, B., Duret, L., Penel, S., Reuter, I., & Apweiler, R. (2005). Integr8 and genome reviews: Integrated views of complete genomes and proteomes. Nucleic Acids Research, 33, D297–D302.
Article PubMed CAS Google Scholar
Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Miller, V., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L., Tatusova, T. A., Wagner, L., & Yaschenko, E. (2007). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 35, D5–D12.
Article PubMed CAS Google Scholar
Okubo, K., Sugawara, H., Gojobori, T., & Tateno, Y. (2006). DDBJ in preparation for overview of research activities behind data submissions. Nucleic Acids Research, 34, D6–D9.
Article PubMed CAS Google Scholar
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Wheeler, D. L. (2007). GenBank. Nucleic Acids Research, 35, D21–D25.
Article PubMed CAS Google Scholar
Pruitt, K. D., Tatusova, T., & Maglott, D. R. (2007). NCBI reference sequences (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 35, D61–D65.
Article PubMed CAS Google Scholar
Dayhoff, M. O. (1978). Atlas of protein sequence and structure, (Vol. 5, Supplement 3). Washington, DC: National Biomedical Research Foundation.
Leinonen, R., Diez, F. G., Binns, D., Fleischmann, W., Lopez, R., & Apweiler, R. (2004). UniProt archive. Bioinformatics, 20, 3236–3237.
Article PubMed CAS Google Scholar
Hubbard, T. J., Aken, B. L., Beal, K., Ballester, B., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cunningham, F., Cutts, T., Down, T., Dyer, S. C., Fitzgerald, S., Fernandez-Banet, J., Graf, S., Haider, S., Hammond, M., Herrero, J., Holland, R., Howe, K., Johnson, N., Kahari, A., Keefe, D., Kokocinski, F., Kulesha, E., Lawson, D., Longden, I., Melsopp, C., Megy, K., Meidl, P., Ouverdin, B., Parker, A., Prlic, A., Rice, S., Rios, D., Schuster, M., Sealy, I., Severin, J., Slater, G., Smedley, D., Spudich, G., Trevanion, S., Vilella, A., Vogel, J., White, S., Wood, M., Cox, T., Curwen, V., Durbin, R., Fernandez-Suarez, X. M., Flicek, P., Kasprzyk, A., Proctor, G., Searle, S., Smith, J., Ureta-Vidal, A., & Birney, E. (2007). Ensembl 2007. Nucleic Acids Research, 35, D610–D617.
Article PubMed CAS Google Scholar
Kouranov, A., Xie, L., de la Cruz, J., Chen, L., Westbrook, J., Bourne, P. E., & Berman, H. M. (2006). The RCSB PDB information portal for structural genomics. Nucleic Acids Research, 34, D302–D305.
Article PubMed CAS Google Scholar
Wieser, D., Kretschmann, E., & Apweiler, R. (2004). Filtering erroneous protein annotation. Bioinformatics, 20, i342–i347.
Article PubMed CAS Google Scholar
Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A. H., Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist, C. J., Lachaize, C., Veuthey, A. L., Gasteiger, E., & Bairoch, A. (2003). Automated annotation of microbial proteomes in SWISS-PROT. Computational Biological Chemistry, 27, 49–58.
Article CAS Google Scholar
Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein annotation with the C4.5 data mining algorith applied on Swiss-Prot. Bioinformatics, 17, 920–926.
Article PubMed CAS Google Scholar
Wu, C. H., Nikolskaya, A., Huang, H., Yeh, L. S., Natale, D. A., Vinayaka, C. R., Hu, Z. Z., Mazumder, R., Kumar, S., Kourtesis, P., Ledley, R. S., Suzek, B. E., Arminski, L., Chen, Y., Zhang, J., Cardenas, J. L., Chung, S., Castro-Alvear, J., Dinkov, G., & Barker, W. C. (2004). PIRSF: Family classification system at the protein information resource. Nucleic Acids Research, 32, D112–D114.
Article PubMed CAS Google Scholar
Natale, D. A., Vinayaka, C. R., & Wu, C. H. (2004). Large-scale, classification-driven, rule-based functional annotation of proteins. In S. Subramaniam (Ed.), Encyclopedia of genetics, genomics, proteomics and bioinformatics. Bioinformatics volume. John Wiley & Sons, Ltd.
Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283.
Article PubMed CAS Google Scholar
Gene Ontology Consortium (2006). The Gene Ontology (GO) project in 2006. Nucleic Acids Research, 34, D322–D326.
Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., & Apweiler, R. (2004). The Gene Ontology Annotation (GOA) database: Sharing knowledge in UniProt with Gene Ontology. Nucleic Acids Research, 32, D262–D266.
Article PubMed CAS Google Scholar
Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I, Bridge, A., Derow, C., Dimmer, E., Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A., Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J., Robbe, K., Roechert, B., Thorneycroft, D., Zhang, Y, Apweiler, R., & Hermjakob, H. (2007). IntAct-open source resource for molecular interaction data. Nucleic Acids Research, 35, D561–D565.
Article PubMed CAS Google Scholar
Rawlings, N. D., Morton, F. R., & Barrett, A. J. (2006). MEROPS: The peptidase database. Nucleic Acids Res, 34, D270–D272.
Article PubMed CAS Google Scholar
Crosby, M. A., Goodman, J. L., Strelets, V. B., Zhang, P., & Gelbart, W. M. (2007). FlyBase: Genomes by the dozen. Nucleic Acids Research, 35, D486–D491.
Article PubMed CAS Google Scholar
Eppig, J. T., Blake, J. A., Bult, C. J., Kadin, J. A., & Richardson, J. E. (2007). The mouse genome database (MGD): New features facilitating a model system. Nucleic Acids Research, 35, D630–D637.
Article PubMed CAS Google Scholar
Bieri, T., Blasiar, D., Ozersky, P., Antoshechkin, I., Bastiani, C., Canaran, P., Chan, J., Chen, N., Chen, W. J., Davis, P., Fiedler, T. J., Girard, L., Han, M., Harris, T. W., Kishore, R., Lee, R., McKay, S., Muller, H. M., Nakamura, C., Petcherski, A., Rangarajan, A., Rogers, A., Schindelman, G., Schwarz, E. M., Spooner, W., Tuli, M. A., Van Auken, K., Wang, D., Wang, X., Williams, G., Durbin, R., Stein, L. D., Sternberg, P. W., & Spieth, J. (2007). WormBase: New content and better access. Nucleic Acids Research, 35, D506–D510.
Article PubMed CAS Google Scholar
Nash, R., Weng, S., Hitz, B., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hirschman, J. E., Hong, E. L., Livstone, M. S., Oughtred, R., Park, J., Skrzypek, M., Theesfeld, C. L., Binkley, G., Dong, Q., Lane, C., Miyasato, S., Sethuraman, A., Schroeder, M., Dolinski, K., Botstein, D., & Cherry, J. M. (2007). Expanded protein information at SGD: New pages and proteome browser. Nucleic Acids Research, 35, D468–D471.
Article PubMed CAS Google Scholar
Rhee, S. Y., Beavis, W., Berardini, T. Z., Chen, G., Dixon, D., Doyle, A., Garcia-Hernandez, M., Huala, E., Lander, G., Montoya, M., Miller, N., Mueller, L. A., Mundodi, S., Reiser, L., Tacklind, J., Weems, D. C., Wu, Y., Xu, I., Yoo, D., Yoon, J., & Zhang, P. (2003). The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Research, 31, 224–228.
Article PubMed CAS Google Scholar
Sigrist, C. J. A, Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., & Bucher, P. (2002). PROSITE: A documented database using patterns and profiles as motif descriptors. Briefings in Bioinformatics, 3, 265–274.
Article PubMed CAS Google Scholar
Gribskov, M., Luthy, R., & Eisenberg, D. (1990). Profile analysis. Methods in Enzymology, 183, 146–159.
Article PubMed CAS Google Scholar
Krogh, A., Brown, M., Mian, I. S., Sjolander, K., & Haussler, D. (1994). Hidden Markov models in computational biology. Applications to protein modeling. Journal of Molecular Biology, 235(5), 1501–1531.
Article PubMed CAS Google Scholar
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press.
Google Scholar
Eddy, S. HMMER2 Profile hidden Markov models for biological sequence analysis. [http://www.hmmer.wustl.edu/].
Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P. S., Pagni, M., & Sigrist, C. J. A. (2006). The PROSITE database. Nucleic Acids Research, 34, D227–D230.
Article PubMed CAS Google Scholar
Attwood, T. K., Bradley, P., Flower, D. R., Gaulton, A., Maudling, N., Mitchell, A. L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., & Zygouri, C. (2003). PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research, 31, 400–402.
Article PubMed CAS Google Scholar
Finn, R. D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S. R., Sonnhammer, E. L., & Bateman, A. (2006). Pfam: Clans, web tools and services. Nucleic Acids Research, 34, D247–D251.
Article PubMed CAS Google Scholar
Letunic, I., Copley, R. R., Pils, B., Pinkert, S., Schultz, J., & Bork, P. (2006). SMART 5: Domains in the context of genomes and networks. Nucleic Acids Research, 34, D257–D260.
Article PubMed CAS Google Scholar
Selengut, J. D., Haft, D. H., Davidsen, T., Ganapathy, A., Gwinn-Giglio, M., Nelson, W. C., Richter, A. R., & White, O. (2007). TIGRFAMs and Genome Properties: Tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Research, 35, D260–D264.
Article PubMed CAS Google Scholar
Mi, H., Guo, N., Kejariwal, A., & Thomas, P. D. (2007). PANTHER version 6: Protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Research, 35, D247–D252.
Article PubMed CAS Google Scholar
Wilson, D., Madera, M., Vogel, C., Chothia, C., & Gough, J. (2007). The SUPERFAMILY database in 2007: Families and functions. Nucleic Acids Research, 35, D308–D313.
Article PubMed CAS Google Scholar
Yeats, C., Maibaum, M., Marsden, R., Dibley, M., Lee, D., Addou, S., & Orengo, C. A. (2006). Gene3D: Modelling protein structure, function and evolution. Nucleic Acids Research, 34, D281–D284.
Article PubMed CAS Google Scholar
Bru, C., Courcelle, E., Carrere, S., Beausse, Y., Dalmar, S., & Kahn, D. (2005). The ProDom database of protein domain families: More emphasis on 3D. Nucleic Acids Research, 33, D212–D215.
Article PubMed CAS Google Scholar
Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J., Chothia, C., & Murzin, A. G. (2004). SCOP database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Research, 32, D226–D229.
Article PubMed CAS Google Scholar
Greene, L. H., Lewis, T. E., Addou, S., Cuff, A., Dallman, T., Dibley, M., Redfern, O., Pearl, F., Nambudiry, R., Reid, A., Sillitoe, I., Yeats, C., Thornton, J. M., & Orengo, C. A. (2007). The CATH domain structure database: New protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Research, 35, D291–D297.
Article PubMed CAS Google Scholar
Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., & Lopez, R. (2005). InterProScan: Protein domains identifier. Nucleic Acids Research, 33, W116–W120.
Article PubMed CAS Google Scholar
Kopp, J., & Schwede, T. (2006). The SWISS-MODEL repository: New features and functionalities. Nucleic Acids Research, 34, D315–D318.
Article PubMed CAS Google Scholar
Pieper, U., Eswar, N., Davis, F. P., Braberg, H., Madhusudhan, M. S., Rossi, A., Marti-Renom, M., Karchin, R., Webb, B. M., Eramian, D., Shen, M. Y., Kelly, L., Melo, F., & Sali, A. (2006). MODBASE: A database of annotated comparative protein structure models and associated resources. Nucleic Acids Research, 34, D291–D295.
Article PubMed CAS Google Scholar
Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., & Apweiler, R. (2004). The international protein index: An integrated database for proteomics experiments. Proteomics, 4, 1985–1988.
Article PubMed CAS Google Scholar
Sterk, P., Kersey, P. J., & Apweiler, R. (2006). Genome Reviews: Standardizing content and representation of information about complete genomes. Omics, 10, 114–118.
Article PubMed CAS Google Scholar
McGinnis, S., & Madden, T. L. (2004). BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research, 32, W20–W25.
Article PubMed CAS Google Scholar
Myers, E. W., & Miller, W. (1988). Optimal alignments in linear space. Computational Applied Bioscience, 4, 11–7.
CAS Google Scholar
Petryszak, P., Kretschmann, E., Wieser, D., & Apweiler, R. (2005). The predictive power of the CluSTr database. Bioinformatics, 21(18), 3604–3609.
Article PubMed CAS Google Scholar
Dodge, C., Schneider, R., & Sander, C. (1998). The HSSP database of protein structure-sequence alignments and family profiles. Nucleic Acids Research, 26, 313–315.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U01 HG02712-01. Additional support comes from the European Commission’s grant 021902RII3, from the NIH grants 1R01HGO2273-01, HHSN266200400 061C, NCI-caBIGICR-10-10-01 and ITR-0205470, and the Swiss Federal Government. InterPro was funded by the award of grant number QLRI-CT-2000–00517 and in part by grant number QLRI-CT-2001000015 from the European Union under the RTD programme “Quality of Life and Management of Living Resources”. InterPro is a member database of the MRC-funded eFamily project. Genome Reviews and Integr8 have been funded or are funded, respectively, by the European Commission’s grants QLRICT-2001000015, and 021902RII3.

Author information

Authors and Affiliations

EMBL Outstation - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Nicola Jane Mulder, Paul Kersey, Manuela Pruess & Rolf Apweiler

Authors

Nicola Jane Mulder
View author publications
You can also search for this author in PubMed Google Scholar
Paul Kersey
View author publications
You can also search for this author in PubMed Google Scholar
Manuela Pruess
View author publications
You can also search for this author in PubMed Google Scholar
Rolf Apweiler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicola Jane Mulder.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mulder, N.J., Kersey, P., Pruess, M. et al. In Silico Characterization of Proteins: UniProt, InterPro and Integr8. Mol Biotechnol 38, 165–177 (2008). https://doi.org/10.1007/s12033-007-9003-x

Download citation

Received: 31 August 2007
Accepted: 31 August 2007
Published: 04 October 2007
Issue Date: February 2008
DOI: https://doi.org/10.1007/s12033-007-9003-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

In Silico Characterization of Proteins: UniProt, InterPro and Integr8

Abstract

Access this article

Similar content being viewed by others

Accurate Prediction of Protein Sequences for Proteogenomics Data Integration

SeqWiz: a modularized toolkit for next-generation protein sequence database management and analysis

Homology-Based Annotation of Large Protein Datasets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

In Silico Characterization of Proteins: UniProt, InterPro and Integr8

Abstract

Access this article

Similar content being viewed by others

Accurate Prediction of Protein Sequences for Proteogenomics Data Integration

SeqWiz: a modularized toolkit for next-generation protein sequence database management and analysis

Homology-Based Annotation of Large Protein Datasets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation