Gene Cluster Prediction and Its Application to Genome Annotation

  • Vikas Rao Pejaver
  • Heewook Lee
  • Sun Kim


Improvements in sequencing technology have made whole-genome sequencing a lot more accessible to researchers in the life sciences. There has been a huge explosion in genomic sequence data over recent years and automated genome-wide function annotation has become a great challenge. The most popular approaches for gene function assignment have been based on sequence similarity. However, homology-based methods are limited in cases where novel sequences show no significant sequence similarity to known genes. This has led to the exploration of innovative methods that make use of additional information such as co-localization, co-evolution and fusion to assign functions computationally. In the case of prokaryotic genomes, functionally related genes tend to be physically clustered together due to evolutionary pressure. Thus, such gene clusters provide effective clues for gene function assignment in prokaryotes. In this chapter, we survey a few of the prominent techniques in this area of research. We also perform simple experiments to detect gene clusters across a given set of genomes. Finally, we provide a few examples from the results of these experiments to show how gene cluster information can be applied to genome annotation and can resolve ambiguities in function assignment.


Gene Cluster Gene Pair Pattern Mining Conservation Score Multiple Genome 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G.D., Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Nat. Acad. Sci. 96(6): 2896–2901 (1999).PubMedCrossRefGoogle Scholar
  2. 2.
    Overbeek, R., et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33(17): 5691–5702 (2005).PubMedCrossRefGoogle Scholar
  3. 3.
    Tatusov, R.L., Koonin, E.V., Lipman, D.J. A genomic perspective on protein families. Science 278(5338): 631–637 (1997).PubMedCrossRefGoogle Scholar
  4. 4.
    He, X., Goldwasser, M. Identifying conserved gene clusters in the presence of orthologous groups. Proceedings of RECOMB, San Diego, CA, USA, pp. 272–280 (2004).Google Scholar
  5. 5.
    Kim, S., Choi, J., Saple, A., Yang, J. A hybrid gene team model and its application to genome analysis. J. Bioinform. Comput. Biol. 4(2): 171–196 (2006).PubMedCrossRefGoogle Scholar
  6. 6.
    Kim, S., Choi, J., Yang, J. Gene teams with relaxed proximity constraint. IEEE Comput. Syst. Bioinform. CA, USA, 44–55.Google Scholar
  7. 7.
    Fujibuchi, W., Ogata, H., Matsuda, H., Kanehisa, M. Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids Res. 28(20): 4029–4036 (2000).PubMedCrossRefGoogle Scholar
  8. 8.
    Matsuda, H., Ishihara, T., Hashimoto, A. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210(2): 305–325 (1999).CrossRefGoogle Scholar
  9. 9.
    Ogata, H., Fujibuchi, W., Goto, S., Kanehisa, M. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Res. 28(20): 4021–4028 (2000).PubMedCrossRefGoogle Scholar
  10. 10.
    Smith, T.F., Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147(1): 195–197 (1981).PubMedCrossRefGoogle Scholar
  11. 11.
    Kanehisa, M., Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1): 27–30 (2000).PubMedCrossRefGoogle Scholar
  12. 12.
  13. 13.
    Zheng, Y., Anton, B.P., Roberts, R.J., Kasif, S. Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinform. 6(243) (2005).Google Scholar
  14. 14.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. Basic Local Alignment Search Tool. J. Mol. Biol. 215(3): 403–410 (1990).PubMedGoogle Scholar
  15. 15.
    Gama-Castro, S., et al. RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res. 36: D120–D124 (2008).PubMedCrossRefGoogle Scholar
  16. 16.
  17. 17.
    Kim, S., Bhan, A., Maryada, B.K., Choi, K., Brun, Y.V. EGGS: extraction of gene clusters by iteratively using genome context based sequence matching techniques. IEEE International Conference on Bioinformatics and Biomedicine, Silicon Valley, CA, USA, pp. 23–28 (2007).Google Scholar
  18. 18.
    Pearson, W.R., Lipman, D.J. Improved tools for biological sequence comparison. Proc. Nat. Acad. Sci. 85(8): 2444–2448 (1988).PubMedCrossRefGoogle Scholar
  19. 19.
    Calabrese, P., Chakravarty, S., Vision, T.J. Fast identification and statistical evaluation of segmental homologies in comparative maps. Bioinformatics 19: 74–80 (2003).CrossRefGoogle Scholar
  20. 20.
    Hu, M., Choi, K., Su, W., Kim, S., Yang, J. A Gene Pattern Mining Algorithm using mutable sets for prokaryotes. BMC Bioinform. 9: 124 (2008).CrossRefGoogle Scholar
  21. 21.
    Hu, M., Yang, J., Su, W. Permu-pattern: discovery of mutable permutation patterns with proximity constraint. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV: USA, pp. 318–326.Google Scholar
  22. 22.
    Yang, Q., Sze, S. Large-scale analysis of gene clustering in bacteria. Genome Res. 18: 949–956 (2008).PubMedCrossRefGoogle Scholar
  23. 23.
    Yang, Y., Gilbert, D., Kim, S. Annotation confidence score for genome annotation: a genome comparison approach. Bioinformatics 26(1): 22–29 (2010).PubMedCrossRefGoogle Scholar
  24. 24.
    Raina, S., Missiakas, D., Georgopoulos, C. The rpoE gene encoding the sigma E (sigma 24) heat shock sigma factor of Escherichia coli. The EMBO Journal 14(5): 1043–1055 (1995).Google Scholar
  25. 25.
  26. 26.
    Bilous, P.T., Cole, S.T., Anderson, W.F., Weiner, J.H. Necleotide sequence of the dmsABC operon encoding the anaerobic dimethylsulphoxide reductase of Escherichia coli. Mol. Microbiol. 2(6): 785–795 (1998).CrossRefGoogle Scholar
  27. 27.
    Fu, Z., Chen, X., Vacic, V., Nan, P., Yang, Z., Jiang, T. MSOAR: a high-throughput ortholog assignment system based on genome rearrangement. J. Comput. Biol. 14(9): 1160–1175 (2007).PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.School of Informatics and Computing, Indiana UniversityBloomingtonUSA

Personalised recommendations