Inferring Function from Homology

Emes, Richard D.

doi:10.1007/978-1-60327-429-6_6

Richard D. Emes³

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 453))

4114 Accesses
4 Citations

Abstract

Modern molecular biology approaches often result in the accumulation of abundant biological sequence data. Ideally, the function of individual proteins predicted using such data would be determined experimentally. However, if a gene of interest has no predictable function or if the amount of data is too large to experimentally assess individual genes, bioinformatics techniques may provide additional information to allow the inference of function.

This chapter proposes a pipeline of freely available Web-based tools to analyze protein-coding DNA sequences of unknown function. Accumulated information obtained during each step of the pipeline is used to build a testable hypothesis of function.

The basis and use of sequence similarity methods of homologue detection are described, with emphasis on BLAST and PSI-BLAST. Annotation of gene function through protein domain detection using SMART and Pfam, and the potential for comparison to whole genome data are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Doolittle, R. F. (1981) Similar amino acid sequences: chance or common ancestry? Science 214, 149–159.
Article PubMed CAS Google Scholar
Pearson, W. R., Sierk, M. L. (2005) The limits of protein sequence comparison? Curr Opin Struct Biol 15, 254–260.
Article PubMed CAS Google Scholar
Ponting, C. P. (2001) Issues in predicting protein function from sequence. Brief Bio-inform 2, 19–29.
Article CAS Google Scholar
Ponting, C. P., Dickens, N. J. (2001) Genome cartography through domain annotation. Genome Biol 2, Comment 2006.
Google Scholar
Fitch, W. M. (2000) Homology a personal view on some of the problems. Trends Genet 16, 227–231.
Article PubMed CAS Google Scholar
Henikoff, S., Greene, E. A., Pietrokovski, S., et al. (1997) Gene families: the taxonomy of protein paralogs and chimeras. Science 278, 609–614.
Article PubMed CAS Google Scholar
Sonnhammer, E. L., Koonin, E. V. (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18, 619–620.
Article PubMed CAS Google Scholar
Webber, C., Ponting, C. P. (2004). Genes and homology. Curr Biol 14, R332–333.
Article PubMed CAS Google Scholar
Tatusov, R. L., Galperin, M. Y., Natale, D. A., et al. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28, 33–36.
Article PubMed CAS Google Scholar
Hurles, M. (2004) Gene duplication: the genomic trade in spare parts. PLoS Biol 2, E206.
Article PubMed Google Scholar
Tatusov, R. L., Fedorova, N. D., Jackson, J. D., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.
Article PubMed Google Scholar
Hubbard, T., Andrews, D., Caccamo, M., et al. (2005) Ensembl 2005. Nucleic Acids Res 33, D447–453.
Article PubMed CAS Google Scholar
Hubbard, T., Barker, D., Birney, E., et al. (2002) The Ensembl genome database project. Nucleic Acids Res 30, 38–41.
Article PubMed CAS Google Scholar
Hinrichs, A. S., Karolchik, D., Baertsch, R., et al. (2006) The UCSC Genome Browser Database: update 2006. Nucleic Acids Res 34, D590–598.
Article PubMed CAS Google Scholar
Karolchik, D., Baertsch, R., Diekhans, M., et al. (2003) The UCSC genome browser database. Nucleic Acids Res 31, 51–54.
Article PubMed CAS Google Scholar
Marchler-Bauer, A., Anderson, J. B., Cherukuri, P. F., et al. (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33, D192–196.
Article PubMed CAS Google Scholar
Marchler-Bauer, A., Anderson, J. B., DeW-eese-Scott, C., et al. (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31, 383–387.
Article PubMed CAS Google Scholar
Marchler-Bauer, A., Panchenko, A. R., Shoemaker, B. A., et al. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30, 281–283.
Article PubMed CAS Google Scholar
Apweiler, R., Attwood, T. K., Bairoch, A., et al. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 29, 37–40.
Article PubMed CAS Google Scholar
Zdobnov, E. M., Apweiler, R. (2001) Inter-ProScan—an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848.
Article PubMed CAS Google Scholar
Bateman, A., Birney, E., Durbin, R., et al. (2000) The Pfam protein families database. Nucleic Acids Res 28, 263–266.
Article PubMed CAS Google Scholar
Bateman, A., Coin, L., Durbin, R., et al. (2004) The Pfam protein families database. Nucleic Acids Res 32, D138–141.
Article PubMed CAS Google Scholar
Finn, R. D., Mistry, J., Schuster-Bockler, B., et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34, D247–251.
Article PubMed CAS Google Scholar
Letunic, I., Copley, R. R., Pils, B., et al. (2006) SMART 5: domains in the context of genomes and networks. Nucleic Acids Res 34, D257–260.
Article PubMed CAS Google Scholar
Letunic, I., Goodstadt, L., Dickens, N. J., et al. (2002) Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res 30, 242–244.
Article PubMed CAS Google Scholar
Schultz, J., Copley, R. R., Doerks, T., et al. (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res 28, 231–234.
Article PubMed CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., et al. (1990) Basic local alignment search tool. J Mol Biol 215, 403–410.
PubMed CAS Google Scholar
Altschul, S. F., Madden, T. L., Schaffer, A. A., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.
Article PubMed CAS Google Scholar
Lopez, R., Silventoinen, V., Robinson, S., et al. (2003) WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res 31, 3795–3798.
Article PubMed CAS Google Scholar
Pearson, W. R., Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444–2448.
Article PubMed CAS Google Scholar
Ponting, C. P., Russell, R. R. (2002) The natural history of protein domains. Annu Rev Biophys Biomol Struct 31, 45–71.
Article PubMed CAS Google Scholar
Durbin, R., Eddy, S. R., Krogh, A., et al. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.
Book Google Scholar
Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.
Article PubMed CAS Google Scholar
Eddy, S. R. (2004) What is a hidden Markov model? Nat Biotechnol 22, 1315–1316.
Article PubMed CAS Google Scholar
Gibbs, R. A., Weinstock, G. M., Metzker, M. L., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521.
Article PubMed CAS Google Scholar
Hillier, L. W., Miller, W., Birney, E., M., K., et al. (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716.
Article CAS Google Scholar
Lander, E. S., Linton, L. M., Birren, B., D., E., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921.
Article PubMed CAS Google Scholar
Waterston, R. H., Lindblad-Toh, K., Bir-ney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.
Article PubMed CAS Google Scholar
Bateman, A. (1997) The structure of a domain common to archaebacteria and the homocystinuria disease protein. Trends Biochem Sci 22, 12–13.
Article PubMed CAS Google Scholar
Emes, R. D., Ponting, C. P. (2001) A new sequence motif linking lissencephaly, Treacher Collins and oral-facial-digital type 1 syndromes, microtubule dynamics and cell migration. Hum Mol Genet 10, 2813–2820.
Article PubMed CAS Google Scholar
Goodstadt, L., Ponting, C. P. (2004) Vitamin K epoxide reductase: homology, active site and catalytic mechanism. Trends Biochem Sci 29, 289–292.
Article PubMed CAS Google Scholar
Morett, E., Bork, P. (1999) A novel trans-activation domain in parkin. Trends Biochem Sci 24, 229–231.
Article PubMed CAS Google Scholar
Dayhoff, M. O., Schwartz, R. M., Orcutt, B. C. (1978) A model for evolutionary change, in (Dayhoff, M. O., ed.), Atlas of Protein Sequence and Structure, vol. 5. National Biomedical Research Foundation, Washington, DC.
Google Scholar
Henikoff, S., Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89, 10915–10919.
Article PubMed CAS Google Scholar
Smith, T. F., Waterman, M. S. (1981) Identification of common molecular subsequences. J Mol Biol 147, 195–197.
Article PubMed CAS Google Scholar
Altschul, S. F., Gish, W. (1996) Local alignment statistics. Methods Enzymol 266, 460–480.
Article PubMed CAS Google Scholar
Altschul, S. F., Koonin, E. V. (1998) Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem Sci 23, 444–447.
Article PubMed CAS Google Scholar
Altschul, S. F., Wootton, J. C., Gertz, E. M., et al. (2005) Protein database searches using compositionally adjusted substitution matrices. Febs J 272, 5101–5109.
Article PubMed CAS Google Scholar
Jones, D. T., Swindells, M. B. (2002) Getting the most from PSI-BLAST. Trends Biochem Sci 27, 161–164.
Article PubMed CAS Google Scholar
Korf, I., Yandell, M., Bedell, J. (2003) BLAST. O'Reilly, Sebastopol CA.
Google Scholar
Wootton, J. C., Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266, 554–571.
Article PubMed CAS Google Scholar
Gribskov, M., Luthy, R., Eisenberg, D. (1990) Profile analysis. Methods Enzymol 183, 146–159.
Article PubMed CAS Google Scholar
Gribskov, M., McLachlan, A. D., Eisen-berg, D. (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 84, 4355–4358.
Article PubMed CAS Google Scholar
Henikoff, S. (1996) Scores for sequence searches and alignments. Curr Opin Struct Biol 6, 353–360.
Article PubMed CAS Google Scholar
Schaffer, A. A., Aravind, L., Madden, T. L., et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29, 2994–3005.
Article PubMed CAS Google Scholar
Sierk, M. L., Pearson, W. R. (2004) Sensitivity and selectivity in protein structure comparison. Protein Sci 13, 773–785.
Article PubMed CAS Google Scholar
Henikoff, J. G., Pietrokovski, S., McCal-lum, C. M., et al. (2000) Blocks-based methods for detecting protein homology. Electrophoresis 21, 1700–1706.
Article PubMed CAS Google Scholar
Henikoff, S., Pietrokovski, S., Henikoff, J. G. (1998) Superior performance in protein homology detection with the Blocks Database servers. Nucleic Acids Res 26, 309–312.
Article PubMed CAS Google Scholar
Schaffer, A. A., Wolf, Y. I., Ponting, C. P., et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15, 1000–1011.
Article PubMed CAS Google Scholar
Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 24, 3836–3845.
Article PubMed CAS Google Scholar
Sadreyev, R., Grishin, N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326, 317–336.
Article PubMed CAS Google Scholar
Sadreyev, R. I., Baker, D., Grishin, N. V. (2003) Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci 12, 2262–2272.
Article PubMed CAS Google Scholar
Sadreyev, R. I., Grishin, N. V. (2004) Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs. Bioinformatics 20, 818–828.
Article PubMed CAS Google Scholar
Soding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960.
Article PubMed Google Scholar
Soding, J., Biegert, A., Lupas, A. N. (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33, W244–248.
Article PubMed Google Scholar
Emes, R. D., Goodstadt, L., Winter, E. E., et al. (2003) Comparison of the genomes of human and mouse lays the foundation of genome zoology. Hum Mol Genet 12, 701–709.
Article PubMed CAS Google Scholar
Kent, W. J. (2002) BLAT—the BLAST-like alignment tool. Genome Res 12, 656–664.
PubMed CAS Google Scholar
Wheeler, D. L., Barrett, T., Benson, D. A., et al. (2006) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 34, D173–180.
Article PubMed CAS Google Scholar
Holm, L., Sander, C. (1998) Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14, 423–429.
Article PubMed CAS Google Scholar

Download references

Acknowledgments

Thanks to Pauline Maden and Caleb Webber for reading and commenting on the manuscript. This work was supported by a Medical Research Council UK Bioinformatics Fellowship (G90/112) to R.D.E.

Author information

Authors and Affiliations

Department of Biology, University College London, London, United Kingdom
Richard D. Emes

Authors

Richard D. Emes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
Jonathan M. Keith PhD

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Emes, R.D. (2008). Inferring Function from Homology. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 453. Humana Press. https://doi.org/10.1007/978-1-60327-429-6_6

Download citation

DOI: https://doi.org/10.1007/978-1-60327-429-6_6
Publisher Name: Humana Press
Print ISBN: 978-1-60327-428-9
Online ISBN: 978-1-60327-429-6
eBook Packages: Springer Protocols

Publish with us

Policies and ethics