Skip to main content

Inferring Function from Homology

  • Protocol
Bioinformatics

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 453))

Abstract

Modern molecular biology approaches often result in the accumulation of abundant biological sequence data. Ideally, the function of individual proteins predicted using such data would be determined experimentally. However, if a gene of interest has no predictable function or if the amount of data is too large to experimentally assess individual genes, bioinformatics techniques may provide additional information to allow the inference of function.

This chapter proposes a pipeline of freely available Web-based tools to analyze protein-coding DNA sequences of unknown function. Accumulated information obtained during each step of the pipeline is used to build a testable hypothesis of function.

The basis and use of sequence similarity methods of homologue detection are described, with emphasis on BLAST and PSI-BLAST. Annotation of gene function through protein domain detection using SMART and Pfam, and the potential for comparison to whole genome data are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Doolittle, R. F. (1981) Similar amino acid sequences: chance or common ancestry? Science 214, 149–159.

    Article  PubMed  CAS  Google Scholar 

  2. Pearson, W. R., Sierk, M. L. (2005) The limits of protein sequence comparison? Curr Opin Struct Biol 15, 254–260.

    Article  PubMed  CAS  Google Scholar 

  3. Ponting, C. P. (2001) Issues in predicting protein function from sequence. Brief Bio-inform 2, 19–29.

    Article  CAS  Google Scholar 

  4. Ponting, C. P., Dickens, N. J. (2001) Genome cartography through domain annotation. Genome Biol 2, Comment 2006.

    Google Scholar 

  5. Fitch, W. M. (2000) Homology a personal view on some of the problems. Trends Genet 16, 227–231.

    Article  PubMed  CAS  Google Scholar 

  6. Henikoff, S., Greene, E. A., Pietrokovski, S., et al. (1997) Gene families: the taxonomy of protein paralogs and chimeras. Science 278, 609–614.

    Article  PubMed  CAS  Google Scholar 

  7. Sonnhammer, E. L., Koonin, E. V. (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18, 619–620.

    Article  PubMed  CAS  Google Scholar 

  8. Webber, C., Ponting, C. P. (2004). Genes and homology. Curr Biol 14, R332–333.

    Article  PubMed  CAS  Google Scholar 

  9. Tatusov, R. L., Galperin, M. Y., Natale, D. A., et al. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28, 33–36.

    Article  PubMed  CAS  Google Scholar 

  10. Hurles, M. (2004) Gene duplication: the genomic trade in spare parts. PLoS Biol 2, E206.

    Article  PubMed  Google Scholar 

  11. Tatusov, R. L., Fedorova, N. D., Jackson, J. D., et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.

    Article  PubMed  Google Scholar 

  12. Hubbard, T., Andrews, D., Caccamo, M., et al. (2005) Ensembl 2005. Nucleic Acids Res 33, D447–453.

    Article  PubMed  CAS  Google Scholar 

  13. Hubbard, T., Barker, D., Birney, E., et al. (2002) The Ensembl genome database project. Nucleic Acids Res 30, 38–41.

    Article  PubMed  CAS  Google Scholar 

  14. Hinrichs, A. S., Karolchik, D., Baertsch, R., et al. (2006) The UCSC Genome Browser Database: update 2006. Nucleic Acids Res 34, D590–598.

    Article  PubMed  CAS  Google Scholar 

  15. Karolchik, D., Baertsch, R., Diekhans, M., et al. (2003) The UCSC genome browser database. Nucleic Acids Res 31, 51–54.

    Article  PubMed  CAS  Google Scholar 

  16. Marchler-Bauer, A., Anderson, J. B., Cherukuri, P. F., et al. (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33, D192–196.

    Article  PubMed  CAS  Google Scholar 

  17. Marchler-Bauer, A., Anderson, J. B., DeW-eese-Scott, C., et al. (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31, 383–387.

    Article  PubMed  CAS  Google Scholar 

  18. Marchler-Bauer, A., Panchenko, A. R., Shoemaker, B. A., et al. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30, 281–283.

    Article  PubMed  CAS  Google Scholar 

  19. Apweiler, R., Attwood, T. K., Bairoch, A., et al. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 29, 37–40.

    Article  PubMed  CAS  Google Scholar 

  20. Zdobnov, E. M., Apweiler, R. (2001) Inter-ProScan—an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848.

    Article  PubMed  CAS  Google Scholar 

  21. Bateman, A., Birney, E., Durbin, R., et al. (2000) The Pfam protein families database. Nucleic Acids Res 28, 263–266.

    Article  PubMed  CAS  Google Scholar 

  22. Bateman, A., Coin, L., Durbin, R., et al. (2004) The Pfam protein families database. Nucleic Acids Res 32, D138–141.

    Article  PubMed  CAS  Google Scholar 

  23. Finn, R. D., Mistry, J., Schuster-Bockler, B., et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34, D247–251.

    Article  PubMed  CAS  Google Scholar 

  24. Letunic, I., Copley, R. R., Pils, B., et al. (2006) SMART 5: domains in the context of genomes and networks. Nucleic Acids Res 34, D257–260.

    Article  PubMed  CAS  Google Scholar 

  25. Letunic, I., Goodstadt, L., Dickens, N. J., et al. (2002) Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res 30, 242–244.

    Article  PubMed  CAS  Google Scholar 

  26. Schultz, J., Copley, R. R., Doerks, T., et al. (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res 28, 231–234.

    Article  PubMed  CAS  Google Scholar 

  27. Altschul, S. F., Gish, W., Miller, W., et al. (1990) Basic local alignment search tool. J Mol Biol 215, 403–410.

    PubMed  CAS  Google Scholar 

  28. Altschul, S. F., Madden, T. L., Schaffer, A. A., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.

    Article  PubMed  CAS  Google Scholar 

  29. Lopez, R., Silventoinen, V., Robinson, S., et al. (2003) WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res 31, 3795–3798.

    Article  PubMed  CAS  Google Scholar 

  30. Pearson, W. R., Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444–2448.

    Article  PubMed  CAS  Google Scholar 

  31. Ponting, C. P., Russell, R. R. (2002) The natural history of protein domains. Annu Rev Biophys Biomol Struct 31, 45–71.

    Article  PubMed  CAS  Google Scholar 

  32. Durbin, R., Eddy, S. R., Krogh, A., et al. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.

    Book  Google Scholar 

  33. Eddy, S. R. (1998) Profile hidden Markov models. Bioinformatics 14, 755–763.

    Article  PubMed  CAS  Google Scholar 

  34. Eddy, S. R. (2004) What is a hidden Markov model? Nat Biotechnol 22, 1315–1316.

    Article  PubMed  CAS  Google Scholar 

  35. Gibbs, R. A., Weinstock, G. M., Metzker, M. L., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521.

    Article  PubMed  CAS  Google Scholar 

  36. Hillier, L. W., Miller, W., Birney, E., M., K., et al. (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716.

    Article  CAS  Google Scholar 

  37. Lander, E. S., Linton, L. M., Birren, B., D., E., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921.

    Article  PubMed  CAS  Google Scholar 

  38. Waterston, R. H., Lindblad-Toh, K., Bir-ney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.

    Article  PubMed  CAS  Google Scholar 

  39. Bateman, A. (1997) The structure of a domain common to archaebacteria and the homocystinuria disease protein. Trends Biochem Sci 22, 12–13.

    Article  PubMed  CAS  Google Scholar 

  40. Emes, R. D., Ponting, C. P. (2001) A new sequence motif linking lissencephaly, Treacher Collins and oral-facial-digital type 1 syndromes, microtubule dynamics and cell migration. Hum Mol Genet 10, 2813–2820.

    Article  PubMed  CAS  Google Scholar 

  41. Goodstadt, L., Ponting, C. P. (2004) Vitamin K epoxide reductase: homology, active site and catalytic mechanism. Trends Biochem Sci 29, 289–292.

    Article  PubMed  CAS  Google Scholar 

  42. Morett, E., Bork, P. (1999) A novel trans-activation domain in parkin. Trends Biochem Sci 24, 229–231.

    Article  PubMed  CAS  Google Scholar 

  43. Dayhoff, M. O., Schwartz, R. M., Orcutt, B. C. (1978) A model for evolutionary change, in (Dayhoff, M. O., ed.), Atlas of Protein Sequence and Structure, vol. 5. National Biomedical Research Foundation, Washington, DC.

    Google Scholar 

  44. Henikoff, S., Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89, 10915–10919.

    Article  PubMed  CAS  Google Scholar 

  45. Smith, T. F., Waterman, M. S. (1981) Identification of common molecular subsequences. J Mol Biol 147, 195–197.

    Article  PubMed  CAS  Google Scholar 

  46. Altschul, S. F., Gish, W. (1996) Local alignment statistics. Methods Enzymol 266, 460–480.

    Article  PubMed  CAS  Google Scholar 

  47. Altschul, S. F., Koonin, E. V. (1998) Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem Sci 23, 444–447.

    Article  PubMed  CAS  Google Scholar 

  48. Altschul, S. F., Wootton, J. C., Gertz, E. M., et al. (2005) Protein database searches using compositionally adjusted substitution matrices. Febs J 272, 5101–5109.

    Article  PubMed  CAS  Google Scholar 

  49. Jones, D. T., Swindells, M. B. (2002) Getting the most from PSI-BLAST. Trends Biochem Sci 27, 161–164.

    Article  PubMed  CAS  Google Scholar 

  50. Korf, I., Yandell, M., Bedell, J. (2003) BLAST. O'Reilly, Sebastopol CA.

    Google Scholar 

  51. Wootton, J. C., Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266, 554–571.

    Article  PubMed  CAS  Google Scholar 

  52. Gribskov, M., Luthy, R., Eisenberg, D. (1990) Profile analysis. Methods Enzymol 183, 146–159.

    Article  PubMed  CAS  Google Scholar 

  53. Gribskov, M., McLachlan, A. D., Eisen-berg, D. (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 84, 4355–4358.

    Article  PubMed  CAS  Google Scholar 

  54. Henikoff, S. (1996) Scores for sequence searches and alignments. Curr Opin Struct Biol 6, 353–360.

    Article  PubMed  CAS  Google Scholar 

  55. Schaffer, A. A., Aravind, L., Madden, T. L., et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29, 2994–3005.

    Article  PubMed  CAS  Google Scholar 

  56. Sierk, M. L., Pearson, W. R. (2004) Sensitivity and selectivity in protein structure comparison. Protein Sci 13, 773–785.

    Article  PubMed  CAS  Google Scholar 

  57. Henikoff, J. G., Pietrokovski, S., McCal-lum, C. M., et al. (2000) Blocks-based methods for detecting protein homology. Electrophoresis 21, 1700–1706.

    Article  PubMed  CAS  Google Scholar 

  58. Henikoff, S., Pietrokovski, S., Henikoff, J. G. (1998) Superior performance in protein homology detection with the Blocks Database servers. Nucleic Acids Res 26, 309–312.

    Article  PubMed  CAS  Google Scholar 

  59. Schaffer, A. A., Wolf, Y. I., Ponting, C. P., et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15, 1000–1011.

    Article  PubMed  CAS  Google Scholar 

  60. Pietrokovski, S. (1996) Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 24, 3836–3845.

    Article  PubMed  CAS  Google Scholar 

  61. Sadreyev, R., Grishin, N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326, 317–336.

    Article  PubMed  CAS  Google Scholar 

  62. Sadreyev, R. I., Baker, D., Grishin, N. V. (2003) Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci 12, 2262–2272.

    Article  PubMed  CAS  Google Scholar 

  63. Sadreyev, R. I., Grishin, N. V. (2004) Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs. Bioinformatics 20, 818–828.

    Article  PubMed  CAS  Google Scholar 

  64. Soding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960.

    Article  PubMed  Google Scholar 

  65. Soding, J., Biegert, A., Lupas, A. N. (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33, W244–248.

    Article  PubMed  Google Scholar 

  66. Emes, R. D., Goodstadt, L., Winter, E. E., et al. (2003) Comparison of the genomes of human and mouse lays the foundation of genome zoology. Hum Mol Genet 12, 701–709.

    Article  PubMed  CAS  Google Scholar 

  67. Kent, W. J. (2002) BLAT—the BLAST-like alignment tool. Genome Res 12, 656–664.

    PubMed  CAS  Google Scholar 

  68. Wheeler, D. L., Barrett, T., Benson, D. A., et al. (2006) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 34, D173–180.

    Article  PubMed  CAS  Google Scholar 

  69. Holm, L., Sander, C. (1998) Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14, 423–429.

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

Thanks to Pauline Maden and Caleb Webber for reading and commenting on the manuscript. This work was supported by a Medical Research Council UK Bioinformatics Fellowship (G90/112) to R.D.E.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Emes, R.D. (2008). Inferring Function from Homology. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 453. Humana Press. https://doi.org/10.1007/978-1-60327-429-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-60327-429-6_6

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-60327-428-9

  • Online ISBN: 978-1-60327-429-6

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics