Abstract
Modern molecular biology approaches often result in the accumulation of abundant biological sequence data. Ideally, the function of individual proteins predicted using such data would be determined experimentally. However, if a gene of interest has no predictable function or if the amount of data is too large to experimentally assess individual genes, bioinformatics techniques may provide additional information to allow the inference of function.
This chapter proposes a pipeline of freely available Web-based tools to analyze protein-coding DNA sequences of unknown function. Accumulated information obtained during each step of the pipeline is used to build a testable hypothesis of function.
The basis and use of sequence similarity methods of homologue detection are described, with emphasis on BLAST and PSI-BLAST. Annotation of gene function through protein domain detection using SMART and Pfam, and the potential for comparison to whole genome data are discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Doolittle, R. F. (1981) Similar amino acid sequences: chance or common ancestry? Science 214, 149–159.
Pearson, W. R., Sierk, M. L. (2005) The limits of protein sequence comparison? Curr Opin Struct Biol 15, 254–260.
Ponting, C. P. (2001) Issues in predicting protein function from sequence. Brief Bio-inform 2, 19–29.
Ponting, C. P., Dickens, N. J. (2001) Genome cartography through domain annotation. Genome Biol 2, Comment 2006.
Fitch, W. M. (2000) Homology a personal view on some of the problems. Trends Genet 16, 227–231.
Henikoff, S., Greene, E. A., Pietrokovski, S., et al. (1997) Gene families: the taxonomy of protein paralogs and chimeras. Science 278, 609–614.
Sonnhammer, E. L., Koonin, E. V. (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18, 619–620.
Webber, C., Ponting, C. P. (2004). Genes and homology. Curr Biol 14, R332–333.
Tatusov, R. L., Galperin, M. Y., Natale, D. A., et al. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28, 33–36.
Hurles, M. (2004) Gene duplication: the genomic trade in spare parts. PLoS Biol 2, E206.
Tatusov, R. L., Fedorova, N. D., Jackson, J. D., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.
Hubbard, T., Andrews, D., Caccamo, M., et al. (2005) Ensembl 2005. Nucleic Acids Res 33, D447–453.
Hubbard, T., Barker, D., Birney, E., et al. (2002) The Ensembl genome database project. Nucleic Acids Res 30, 38–41.
Hinrichs, A. S., Karolchik, D., Baertsch, R., et al. (2006) The UCSC Genome Browser Database: update 2006. Nucleic Acids Res 34, D590–598.
Karolchik, D., Baertsch, R., Diekhans, M., et al. (2003) The UCSC genome browser database. Nucleic Acids Res 31, 51–54.
Marchler-Bauer, A., Anderson, J. B., Cherukuri, P. F., et al. (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33, D192–196.
Marchler-Bauer, A., Anderson, J. B., DeW-eese-Scott, C., et al. (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31, 383–387.
Marchler-Bauer, A., Panchenko, A. R., Shoemaker, B. A., et al. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30, 281–283.
Apweiler, R., Attwood, T. K., Bairoch, A., et al. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 29, 37–40.
Zdobnov, E. M., Apweiler, R. (2001) Inter-ProScan—an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848.
Bateman, A., Birney, E., Durbin, R., et al. (2000) The Pfam protein families database. Nucleic Acids Res 28, 263–266.
Bateman, A., Coin, L., Durbin, R., et al. (2004) The Pfam protein families database. Nucleic Acids Res 32, D138–141.
Finn, R. D., Mistry, J., Schuster-Bockler, B., et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34, D247–251.
Letunic, I., Copley, R. R., Pils, B., et al. (2006) SMART 5: domains in the context of genomes and networks. Nucleic Acids Res 34, D257–260.
Letunic, I., Goodstadt, L., Dickens, N. J., et al. (2002) Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res 30, 242–244.
Schultz, J., Copley, R. R., Doerks, T., et al. (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res 28, 231–234.
Altschul, S. F., Gish, W., Miller, W., et al. (1990) Basic local alignment search tool. J Mol Biol 215, 403–410.
Altschul, S. F., Madden, T. L., Schaffer, A. A., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.
Lopez, R., Silventoinen, V., Robinson, S., et al. (2003) WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res 31, 3795–3798.
Pearson, W. R., Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444–2448.
Ponting, C. P., Russell, R. R. (2002) The natural history of protein domains. Annu Rev Biophys Biomol Struct 31, 45–71.
Durbin, R., Eddy, S. R., Krogh, A., et al. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.
Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.
Eddy, S. R. (2004) What is a hidden Markov model? Nat Biotechnol 22, 1315–1316.
Gibbs, R. A., Weinstock, G. M., Metzker, M. L., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521.
Hillier, L. W., Miller, W., Birney, E., M., K., et al. (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716.
Lander, E. S., Linton, L. M., Birren, B., D., E., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921.
Waterston, R. H., Lindblad-Toh, K., Bir-ney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.
Bateman, A. (1997) The structure of a domain common to archaebacteria and the homocystinuria disease protein. Trends Biochem Sci 22, 12–13.
Emes, R. D., Ponting, C. P. (2001) A new sequence motif linking lissencephaly, Treacher Collins and oral-facial-digital type 1 syndromes, microtubule dynamics and cell migration. Hum Mol Genet 10, 2813–2820.
Goodstadt, L., Ponting, C. P. (2004) Vitamin K epoxide reductase: homology, active site and catalytic mechanism. Trends Biochem Sci 29, 289–292.
Morett, E., Bork, P. (1999) A novel trans-activation domain in parkin. Trends Biochem Sci 24, 229–231.
Dayhoff, M. O., Schwartz, R. M., Orcutt, B. C. (1978) A model for evolutionary change, in (Dayhoff, M. O., ed.), Atlas of Protein Sequence and Structure, vol. 5. National Biomedical Research Foundation, Washington, DC.
Henikoff, S., Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89, 10915–10919.
Smith, T. F., Waterman, M. S. (1981) Identification of common molecular subsequences. J Mol Biol 147, 195–197.
Altschul, S. F., Gish, W. (1996) Local alignment statistics. Methods Enzymol 266, 460–480.
Altschul, S. F., Koonin, E. V. (1998) Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem Sci 23, 444–447.
Altschul, S. F., Wootton, J. C., Gertz, E. M., et al. (2005) Protein database searches using compositionally adjusted substitution matrices. Febs J 272, 5101–5109.
Jones, D. T., Swindells, M. B. (2002) Getting the most from PSI-BLAST. Trends Biochem Sci 27, 161–164.
Korf, I., Yandell, M., Bedell, J. (2003) BLAST. O'Reilly, Sebastopol CA.
Wootton, J. C., Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266, 554–571.
Gribskov, M., Luthy, R., Eisenberg, D. (1990) Profile analysis. Methods Enzymol 183, 146–159.
Gribskov, M., McLachlan, A. D., Eisen-berg, D. (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 84, 4355–4358.
Henikoff, S. (1996) Scores for sequence searches and alignments. Curr Opin Struct Biol 6, 353–360.
Schaffer, A. A., Aravind, L., Madden, T. L., et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29, 2994–3005.
Sierk, M. L., Pearson, W. R. (2004) Sensitivity and selectivity in protein structure comparison. Protein Sci 13, 773–785.
Henikoff, J. G., Pietrokovski, S., McCal-lum, C. M., et al. (2000) Blocks-based methods for detecting protein homology. Electrophoresis 21, 1700–1706.
Henikoff, S., Pietrokovski, S., Henikoff, J. G. (1998) Superior performance in protein homology detection with the Blocks Database servers. Nucleic Acids Res 26, 309–312.
Schaffer, A. A., Wolf, Y. I., Ponting, C. P., et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15, 1000–1011.
Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 24, 3836–3845.
Sadreyev, R., Grishin, N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326, 317–336.
Sadreyev, R. I., Baker, D., Grishin, N. V. (2003) Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci 12, 2262–2272.
Sadreyev, R. I., Grishin, N. V. (2004) Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs. Bioinformatics 20, 818–828.
Soding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960.
Soding, J., Biegert, A., Lupas, A. N. (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33, W244–248.
Emes, R. D., Goodstadt, L., Winter, E. E., et al. (2003) Comparison of the genomes of human and mouse lays the foundation of genome zoology. Hum Mol Genet 12, 701–709.
Kent, W. J. (2002) BLAT—the BLAST-like alignment tool. Genome Res 12, 656–664.
Wheeler, D. L., Barrett, T., Benson, D. A., et al. (2006) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 34, D173–180.
Holm, L., Sander, C. (1998) Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14, 423–429.
Acknowledgments
Thanks to Pauline Maden and Caleb Webber for reading and commenting on the manuscript. This work was supported by a Medical Research Council UK Bioinformatics Fellowship (G90/112) to R.D.E.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Emes, R.D. (2008). Inferring Function from Homology. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 453. Humana Press. https://doi.org/10.1007/978-1-60327-429-6_6
Download citation
DOI: https://doi.org/10.1007/978-1-60327-429-6_6
Publisher Name: Humana Press
Print ISBN: 978-1-60327-428-9
Online ISBN: 978-1-60327-429-6
eBook Packages: Springer Protocols