A Comparative Genome Annotation System
  • Kwangmin Choi
  • Youngik Yang
  • Sun Kim
Part of the Methods in Molecular Biology™ book series (MIMB, volume 395)


Recent advances in genome sequencing technology and algorithms have made it possible to determine the sequence of a whole genome quickly in a cost-effective manner. As a result, there are more than 200 completely sequenced genomes. However, annotation of a genome is still a challenging task. One of the most effective methods to annotate a newly sequenced genome is to compare it with well-annotated and closely related genomes using computational tools and databases. Comparing genomes requires use of a number of computational tools and produces a large amount of output, which should be analyzed by genome annotators. Because of this difficulty, genome projects are mostly carried out at large genome sequencing centers. To alleviate the requirement for expert knowledge in computational tools and databases, we have developed a web-based genome annotation system, called CGAS (a comparative genome annotation system; This chapter describes how to use CGAS and necessary background knowledge on the computational tools and resources. As an example, a Bacillus subtilis genome is considered as an unannotated target genome and compared with several reference genomes, including Bacillus halodurans, Oceanobacillus iheyensis HTE831, and Bacillus cereus group genomes (representative strain of Bacillus. cereus, Bacillus anthracis).

Key Words

Comparative genomics genome annotation Bidirectional Best Hit (BBH) sequence clustering protein domain genome context 



This research was partially by NSF CAREER Award DBI-0237901 INGEN (Indiana Genomics Initiatives), and AVIDD (Analysis and Visualization of Instrument-Driven Data) Linux cluster.


  1. 1.
    Kim, S. and Lee, J. (2007) BAG: A Graph Theoretic Sequence Clustering Algorithm. International Journal of Data Mining and Bioinformatics. 1(2), 178–200.CrossRefGoogle Scholar
  2. 2.
    Delcher, A. L., Harmon, D., Kasif, S., White, O., and Salzberg, S. L. (1999) Improved microbial gene identification with GLIMMER. Nucl. Acids Res. 27, 4636–4641.CrossRefPubMedGoogle Scholar
  3. 3.
    Choi, K., Ma, Y., Choi, J.-H., and Kim, S. (2005) PLATCOM: a platform for computational comparative genomics. Bioinformatics 21, 2514–2516.CrossRefPubMedGoogle Scholar
  4. 4.
    Hulo, N., Bairoch, A., Bulliard, V., et al. (2006) The PROSITE database. Nucl. Acids Res. 34, D227–D230.CrossRefPubMedGoogle Scholar
  5. 5.
    Sigrist, C. J. A., De Castro, E., Langendijk-Genevaux, P. S., Le Saux, V., Bairoch, A., and Hulo, N. (2005) ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics 21, 4060–4066.CrossRefPubMedGoogle Scholar
  6. 6.
    Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J. P., Chothia, C., and Murzin, A. G. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucl. Acids Res. 32, D226–D229.CrossRefPubMedGoogle Scholar
  7. 7.
    Doerks, T., von Mering, C., and Bork, P. (2004) Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes. Nucl. Acids Res. 32, 6321–6326.CrossRefPubMedGoogle Scholar
  8. 8.
    Wolf, Y. I., Rogozin, I. B., Kondrashov, A. S., and Koonin, E. V. (2001) Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 11, 356–372.CrossRefPubMedGoogle Scholar
  9. 9.
    Bork, P., Jensen, L. J., von Mering, C., Ramani, A. K., Lee, I., and Marcotte, E. M. (2004) Protein interaction networks from yeast to human. Curr. Opin. Struct. Biol. 14, 292–299.CrossRefPubMedGoogle Scholar
  10. 10.
    Kim, S., Choi, J. -H., and Yang, J. (2005) Gene teams with relaxed proximity constraint. Proc. IEEE Comput. Syst. Bioinform. Conf. 44–55.Google Scholar
  11. 11.
    Kim, S., Choi, J. -H., Saple, A., and Yang, J. (2006) A hybrid gene team model and its application to genome analysis. J. Bioinform. Comput. Biol. 4, 171–196CrossRefPubMedGoogle Scholar
  12. 12.
    Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. (2004) The KEGG resource for deciphering the genome. Nucl. Acids Res. 32, D277–D280.CrossRefPubMedGoogle Scholar
  13. 13.
    Lukashin, A. V. and Borodovsky, M. (1998) GeneMark.hmm: new solutions for gene finding. Nucl. Acids Res. 26, 1107–1115.CrossRefPubMedGoogle Scholar

Copyright information

© Humana Press Inc. 2007

Authors and Affiliations

  • Kwangmin Choi
    • 1
  • Youngik Yang
    • 2
  • Sun Kim
    • 3
  1. 1.School of Informatics, Indiana UniversityBloomington
  2. 2.School of Informatics, Indiana UniversityBloomington
  3. 3.School of Informatics, Indiana UniversityBloomington

Personalised recommendations