Gene Function Analysis pp 93-108

Part of the Methods in Molecular Biology™ book series (MIMB, volume 408) | Cite as

Sybil: Methods and Software for Multiple Genome Comparison and Visualization

  • Jonathan Crabtree
  • Samuel V. Angiuoli
  • Jennifer R. Wortman
  • Owen R. White

Abstract

With the successful completion of genome sequencing projects for a variety of model organisms, the selection of candidate organisms for future sequencing efforts has been guided increasingly by a desire to enable comparative genomics. This trend has both depended on and encouraged the development of software tools that can elucidate and capitalize on the similarities and differences between genomes. “Sybil,” one such tool, is a primarily web-based software package whose primary goal is to facilitate the analysis and visualization of comparative genome data, with a particular emphasis on protein and gene cluster data. Herein, a two-phase protein clustering algorithm, used to generate protein clusters suitable for analysis through Sybil and a method for creating graphical displays of protein or gene clusters that span multiple genomes are described. When combined, these two relatively simple techniques provide the user of the Sybil software (The Institute for Genomic Research [TIGR] Bioinformatics Department) with a browsable graphical display of his or her “input” genomes, showing which genes are conserved based on the parameters supplied to the protein clustering algorithm. For any given protein cluster the graphical display consists of a local alignment of the genomes in which the clustered genes are located. The genomes are arranged in a vertical stack, as in a multiple alignment, and shaded areas are used to connect genes in the same cluster, thus displaying conservation at the protein level in the context of the underlying genomic sequences. The authors have found this display—and slight variants thereof—useful for a variety of annotation and comparison tasks, ranging from identifying “missed” gene models or single-exon discrepancies between orthologous genes, to finding large or small regions of conserved gene synteny, and investigating the properties of the breakpoints between such regions.

Key Words

Bioinformatics Bioperl comparative genomics ortholog paralog protein clustering visualization 

References

  1. 1.
    Remm, M., Storm, C. E., and Sonnhammer, E. L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314(5), 1041–1052.CrossRefPubMedGoogle Scholar
  2. 2.
    Enright, A. J., Van Dongen, S., and Ouzounis, C. A. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30(7), 1575–1584.CrossRefPubMedGoogle Scholar
  3. 3.
    O’Brien, K. P., Remm, M., and Sonnhammer, E. L. (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33, D476–D480.CrossRefPubMedGoogle Scholar
  4. 4.
    Fujibuchi, W., Ogata, H., Matsuda, H., and Kanehisa, M. (2000) Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids Res. 28(20), 4029–4036.CrossRefPubMedGoogle Scholar
  5. 5.
    Ogata, H., Fujibuchi, W., Goto, S., and Kanehisa, M. (2000) A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Res. 28(20), 4021–4028.CrossRefPubMedGoogle Scholar
  6. 6.
    Li, L., Stoeckert, C. J., Jr., and Roos, D. S. (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13(9), 2178–2179.CrossRefPubMedGoogle Scholar
  7. 7.
    Tatusov, R. L., Koonin, E. V., and Lipman, D. J. (1997) A genomic perspective on protein families. Science 278, 631–637.CrossRefPubMedGoogle Scholar
  8. 8.
    Tatusov, R. L., Galperin, M. Y., Natale, D. A., and Koonin, E. V. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36.CrossRefPubMedGoogle Scholar
  9. 9.
    Tatusov, R. L., Natale, D. A., Garkavtsev, I. V., et al. (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29, 22–28.CrossRefPubMedGoogle Scholar
  10. 10.
    Jaccard, P. (1908) Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat. 44, 223–270.Google Scholar
  11. 11.
    Stajich, J. E., Block, D., Boulez, K., et al. (2002) The Bioperl toolkit: perl modules for the life sciences. Genome Res. 12(10), 1611–1618.CrossRefPubMedGoogle Scholar
  12. 12.
    Pan, X., Stein, L., and Brendel, V. (2005) SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatics 21(17), 3461–3468.CrossRefPubMedGoogle Scholar
  13. 13.
    Carver, T. J., Rutherford, K. M., Berriman, M., Rajandream, M. A., Barrell, B. G., and Parkhill, J. (2005) ACT: the Artemis Comparison Tool. Bioinformatics 21(16), 3422–3423.CrossRefPubMedGoogle Scholar
  14. 14.
    Lewis, S. E., Searle, S. M. J., Harris, N., et al. (2002) Apollo: a sequence annotation editor. Genome Biol. 3(12), RESEARCH0082.Google Scholar
  15. 15.
    Gish, W. (1996–2005) http://blast.wustl.edu.
  16. 16.
    Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410.PubMedGoogle Scholar
  17. 17.
    Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680.CrossRefPubMedGoogle Scholar
  18. 18.
    Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2001) Minimum Spanning Trees, in Introduction to Algorithms, 2nd ed., MIT Press and McGraw-Hill, pp. 561–579.Google Scholar
  19. 19.
    Chado—The GMOD Database Schema. http://www.gmod.org/schema.
  20. 20.
    GMOD—Generic Software Components for Model Organism Databases. http://www.gmod.org.
  21. 21.
    BSML: Bioinformatic Sequence Markup Language. http://www.bsml.org.
  22. 22.
    El-Sayed, N. M., Myler, P. J., Bartholomeu, D. C., et al. (2005) The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease. Science 309(5733), 409–415.CrossRefPubMedGoogle Scholar
  23. 23.
    Stein, L. D., Mungall, C., Shu, S., et al. (2002) The generic genome browser: a building block for a model organism system database. Genome Res. 12(10), 1599–1610.CrossRefPubMedGoogle Scholar
  24. 24.
    Kent, W. J., Sugnet, C. W., Furey, T. S., et al. (2002) The Human Genome Browser at UCSC. Genome Res. 12(6), 996–1006.PubMedGoogle Scholar
  25. 25.
    El-Sayed, N. M., Myler, P. J., Blandin, G., et al. (2005) Comparative Genomics of Trypanosomatid Parasitic Protozoa. Science 309(5733), 404–409.CrossRefPubMedGoogle Scholar
  26. 26.
    Kruskal, J. B. (1956) On the shortest spanning subtree and the traveling salesman problem. Proc. AMS 7, 48–50.CrossRefGoogle Scholar
  27. 27.
    Scalable Vector Graphics (SVG). http://www.w3.org/Graphics/SVG/.
  28. 28.
  29. 29.
    Dunning Hotopp, J. C., Lin, M., Madupu, R., et al. (2006) Comparative Genomics of Emerging Human Ehrlichiosis Agents. PLoS Genet. 2(2), E21.CrossRefPubMedGoogle Scholar
  30. 30.
    Tettelin, H., Masignani, V., Cieslewicz, M. J., et al. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl. Acad. Sci. 102(39), 13,950–13,955.CrossRefPubMedGoogle Scholar

Copyright information

© Humana Press Inc. 2007

Authors and Affiliations

  • Jonathan Crabtree
    • 1
  • Samuel V. Angiuoli
    • 1
  • Jennifer R. Wortman
    • 1
  • Owen R. White
    • 1
  1. 1.The Institute for Genomic ResearchRockville

Personalised recommendations