Skip to main content

OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNBI,volume 3678)

Abstract

The OMA project is a large-scale effort to identify groups of orthologs from complete genome data, currently 150 species. The algorithm relies solely on protein sequence information and does not require any human supervision. It has several original features, in particular a verification step that detects paralogs and prevents them from being clustered together. Consistency checks and verification are performed throughout the process. The resulting groups, whenever a comparison could be made, are highly consistent both with EC assignments, and with assignments from the manually curated database HAMAP. A highly accurate set of orthologous sequences constitutes the basis for several other investigations, including phylogenetic analysis and protein classification.

Keywords

  • Vertex Cover
  • Maximal Clique
  • Lateral Gene Transfer
  • Orthologous Relationship
  • Multidomain Protein

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/11554714_6
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   74.99
Price excludes VAT (USA)
  • ISBN: 978-3-540-31814-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.00
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fitch, W.M.: Distinguishing homologous from analogous proteins. Syst. Zool. 19, 99–113 (1970)

    CrossRef  Google Scholar 

  2. Koonin, E.V.: An apology for orthologs - or brave new memes. Genome. Biol. 2, COMMENT1005 (2001)

    CrossRef  Google Scholar 

  3. Tatusov, R.L., Koonin, E.V., Lipman, D.J.: A genomic perspective on protein families. Science 278, 631–637 (1997)

    CrossRef  Google Scholar 

  4. Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.: The cog database: an updated version includes eukaryotes. BMC Bioinformatics 4 (2003), http://www.biomedcentral.com/1471–2105/4/41

  5. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)

    CrossRef  Google Scholar 

  6. Fujibuchi, W., Ogata, H., Matsuda, H., Kanehisa, M.: Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids Res. 28, 4029–4036 (2000)

    CrossRef  Google Scholar 

  7. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.: The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, 277–280 (2004)

    CrossRef  Google Scholar 

  8. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    CrossRef  Google Scholar 

  9. Remm, M., Storm, C., Sonnhammer, E.: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052 (2001)

    CrossRef  Google Scholar 

  10. Li, L., Stoeckert, C.J.J., Roos, D.S.: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003)

    CrossRef  Google Scholar 

  11. Lee, Y., Sultana, R., Pertea, G., Cho, J., Karamycheva, S., Tsai, J., Parvizi, B., Cheung, F., Antonescu, V., White, J., Holt, I., Liang, F., Quackenbush, J.: Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome. Res. 12, 493–502 (2002)

    CrossRef  Google Scholar 

  12. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. U. S. A. 96, 4285–4288 (1999)

    CrossRef  Google Scholar 

  13. Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., Durbin, R., Eyras, E., Gilbert, J., Hammond, M., Huminiecki, L., Kasprzyk, A., Lehvaslaiho, H., Lijnzaad, P., Melsopp, C., Mongin, E., Pettett, R., Pocock, M., Potter, S., Rust, A., Schmidt, E., Searle, S., Slater, G., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Stupka, E., Ureta-Vidal, A., Vastrik, I., Clamp, M.: The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002)

    CrossRef  Google Scholar 

  14. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 33 Database Issue, 34–38 (2005)

    Google Scholar 

  15. Gonnet, G.H., Hallett, M.T., Korostensky, C., Bernardin, L.: Darwin v. 2.0 an interpreted computer language for the biosciences. Bioinformatics 16, 101–103 (2000)

    CrossRef  Google Scholar 

  16. Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A.H., Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist, C.J.A., Lachaize, C., Veuthey, A.L., Gasteiger, E., Bairoch, A.: Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27, 49–58 (2003)

    CrossRef  Google Scholar 

  17. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)

    CrossRef  Google Scholar 

  18. Gonnet, G.H., Cohen, M.A., Benner, S.A.: Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992)

    CrossRef  Google Scholar 

  19. Gonnet, G.H.: A tutorial introduction to computational biochemistry using Darwin. Technical report, Informatik, ETH Zurich, Switzerland (1994)

    Google Scholar 

  20. Brenner, S.E., Chothia, C., Hubbard, J.T.: Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. U. S. A. 95, 6073–6078 (1998)

    CrossRef  Google Scholar 

  21. von Mering, C., Jensen, L.J., Snel, B., Hooper, S.D., Krupp, M., Foglierini, M., Jouffre, N., Huynen, M.A., Bork, P.: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33 Database Issue, 433–437 (2005)

    Google Scholar 

  22. Balasubramanian, R., Fellows, M.R., Raman, V.: An improved fixed-parameter algorithm for vertex cover. Inf. Process. Lett. 65, 163–168 (1998)

    CrossRef  MathSciNet  Google Scholar 

  23. Bairoch, A.: The ENZYME database in 2000. Nucleic Acids Res. 28, 304–305 (2000)

    CrossRef  Google Scholar 

  24. Jensen, R.A.: Orthologs and paralogs - we need to get it right. Genome. Biol. 2, INTERACTIONS1002 (2001)

    Google Scholar 

  25. Vogel, C., Bashton, M., Kerrison, N.D., Chothia, C., Teichmann, S.A.: Structure, function and evolution of multidomain proteins. Curr. Opin. Struct. Biol. 14, 208–216 (2004)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dessimoz, C. et al. (2005). OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements. In: McLysaght, A., Huson, D.H. (eds) Comparative Genomics. RCG 2005. Lecture Notes in Computer Science(), vol 3678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11554714_6

Download citation

  • DOI: https://doi.org/10.1007/11554714_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28932-6

  • Online ISBN: 978-3-540-31814-9

  • eBook Packages: Computer ScienceComputer Science (R0)