Skip to main content
Log in

A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome

  • Research Article
  • Published:
Genes & Genomics Aims and scope Submit manuscript

Abstract

Protein coding gene annotation errors in prokaryotic genomes are accumulating continually in bioinformatics databases, while the update rate of genome annotation can not keep up with the explosive increasing genome sequences in most cases. Hence it is critical to manually rectify the genome annotation errors. In this paper, a hybrid strategy by combing the gene ab initio predicting programs and the over annotated gene re-annotation programs is proposed for re-annotation of the protein coding genes in prokaryotic genomes. Based on this strategy, the protein coding genes in Geobacter sulfurreducens PCA is comprehensively re-annotated. As a consequence, 16 hypothetical genes are annotated as non-coding sequences and 104 missing genes are retrieved as protein coding genes. Subsequent function analysis and sequences analysis show that the predicting results are much reliable and robust. Further application to other genomes show that this work can provide alternative tools for later post-process of prokaryotic genome annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M et al (2008) The RAST server: rapid annotations using subsystems technology. BMC Genom 9:75

    Article  Google Scholar 

  • Bakke P, Carney N, Deloache W, Gearing M, Ingvorsen K, Lotz M, McNair J, Penumetcha P, Simpson S, Voss L et al (2009) Evaluation of three automated genome annotations for halorhabdus utahensis. PLoS ONE 4:e6291

    Article  PubMed Central  PubMed  Google Scholar 

  • Besemer J, Lomsadze A, Borodovsky M (2001) GeneMarks: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607–2618

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF et al (1997) The complete genome sequence of Escherichia coli K-12. Science 277:1453–1462

    Article  CAS  PubMed  Google Scholar 

  • Brenner SE (1999) Errors in genome annotation. Trends Genet 15:132–133

    Article  CAS  PubMed  Google Scholar 

  • Burset M, Guigo R (1996) Evaluation of gene structure prediction programs. Genomics 34:353–367

    Article  CAS  PubMed  Google Scholar 

  • Chen LL, Ma BG, Gao N (2008) Reannotation of hypothetical ORFs in plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043. FEBS J 275:198–206

    Article  CAS  PubMed  Google Scholar 

  • Chou KC, Zhang CT (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30:275–349

    Article  CAS  PubMed  Google Scholar 

  • Delcher AL, Bratke KA, Powers EC, Salzberg SL (2007) Identifying bacterial genes and endosymbiont DNA with glimmer. Bioinformatics 23:673–679

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Devos D, Valencia A (2001) Intrinsic errors in genome annotation. Trends Genet 17:429–431

    Article  CAS  PubMed  Google Scholar 

  • Gao F, Zhang CT (2004) Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics 20:673–681

    Article  CAS  PubMed  Google Scholar 

  • Gao N, Chen LL, Ji HF, Wang W, Chang JW, Gao B, Zhang L, Zhang SC, Zhang HY (2010) DIGAP—a database of improved gene annotation for phytopathogens. BMC Genom 11:54

    Article  Google Scholar 

  • Guo FB, Xiong L, Teng JL, Yuen KY, Lau SK, Woo PC (2013) Re-annotation of protein-coding genes in 10 complete genomes of Neisseriaceae family by combining similarity—based and composition—based methods. DNA Res 20:273–286

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform 11:19

    Article  Google Scholar 

  • Kisand V, Lettieri T (2013) Genome sequencing of bacteria: sequencing, de novo assembly and rapid analysis using open source tools. BMC Genom 14:211

    Article  CAS  Google Scholar 

  • Krause L, McHardy AC, Nattkemper TW, Pühler A, Stoye J, Meyer F (2007) GISMO-gene identification using a support vector machine for ORF classification. Nucleic Acids Res 35:540–549

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Kyrpides NC (2009) Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream. Nat Biotechnol 27:627–632

    Article  CAS  PubMed  Google Scholar 

  • Li M, Wang J, Chen X, Wang H, Pan Y (2011) A local average connectivity-based method for identifying essential proteins from the network level. Comput Biol Chem 35:143–150

    Article  PubMed  Google Scholar 

  • Liao B, Xiong Q, Li D (2012) Incorporating secondary features into the general form of Chou’s PseAAC for predicting protein structural class. Protein Peptide Lett 19:1133–1138

    Article  CAS  Google Scholar 

  • Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM, Kyrpides NC (2010) The genomes on line database (gold) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 38:D346–D354

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Luo CW, Hu GQ, Zhu HQ (2009) Genome reannotation of Escherichia coli CFT073 with new insights into virulence. BMC Genom 10:552

    Article  Google Scholar 

  • Methé BA, Nelson KE, Eisen JA, Paulsen IT, Nelson W, Heidelberg JF, Wu D, Wu M, Ward N, Beanan MJ et al (2003) Genome of Geobacter sulfurreducens: metal reduction in subsurface environments. Science 302:1967–1969

    Article  PubMed  Google Scholar 

  • Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L (2008) Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinform 9:353

    Article  Google Scholar 

  • Pallejà A, Harrington ED, Bork P (2008) Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? BMC Genom 9:335

    Article  Google Scholar 

  • Pati A, Ivanova NN, Mikhailova N, Ovchinnikova G, Hooper SD, Lykidis A, Kyrpides NC (2010) GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods 7:455–457

    Article  CAS  PubMed  Google Scholar 

  • Petty NK (2010) Genome annotation: man versus machine. Nat Rev Microbiol 8:762

    Article  CAS  PubMed  Google Scholar 

  • Poptsova MS, Gogarten JP (2010) Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiol-SGM 156:1909–1917

    Article  CAS  Google Scholar 

  • Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35:D61–D65

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Qiu Y, Cho BK, Park YS, Lovley D, Palsson BØ, Zengler K (2010) Structural and operational complexity of the Geobacter sulfurreducens genome. Genome Res 20:1304–1311

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Reed JL, Famili I, Thiele I, Palsson BO (2006) Towards multidimensional genome annotation. Nat Rev Genet 7:130–141

    Article  CAS  PubMed  Google Scholar 

  • Reeves GA, Talavera D, Thornton JM (2009) Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface 6:129–147

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 29:22–28

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Ussery DW, Hallin PF (2004) Genome update: annotation quality in sequenced microbial genomes. Microbil-SGM 150:2015–2017

    Article  CAS  Google Scholar 

  • Wang Q, Lei Y, Xu X, Wang G, Chen LL (2013) Theoretical prediction and experimental verification of protein-coding genes in plant pathogen genome Agrobacterium tumefaciens strain C58. PLoS ONE 7:e43176

    Article  Google Scholar 

  • Warren AS, Archuleta J, Feng WC, Setubal JC (2010) Missing genes in the annotation of prokaryotic genomes. BMC Bioinform 11:131

    Article  Google Scholar 

  • Yu JF, Sun X (2010) Reannotation of protein-coding genes based on an improved graphical representation of DNA sequence. J Comput Chem 31:2126–2135

    Article  CAS  PubMed  Google Scholar 

  • Yu JF, Sun X, Wang JH (2009) TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications. J Theor Biol 261:459–468

    Article  CAS  PubMed  Google Scholar 

  • Yu JF, Xiao K, Jiang DK, Guo J, Wang JH, Sun X (2011) An integrative method for identifying the over-annotated protein-coding genes in microbial genomes. DNA Res 18:435–449

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Yu JF, Jiang DK, Xiao K, Jin Y, Wang JH, Sun X (2012) Discriminate the falsely predicted protein-coding genes in Aeropyrum Pernix K1 genome based on graphical representation. MATCH Commun Math Comput Chem 67:845–866

    CAS  Google Scholar 

  • Yu JF, Guo ZZ, Sun X, Wang JH (2014) A review of the computational methods for identifying the over-annotated genes and missing genes in microbial genomes. Current Bioinform 9:147–154

    Article  CAS  Google Scholar 

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (Projects No. 61302186 and No. 61271378), Shandong Natural Science Foundation (Project No. ZR2010CQ041) and the funding from the State Key Laboratory of Bioelectronics of Southeast University.

Conflict of interest

The authors declare that there is no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jia-Feng Yu or Xiao Sun.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, JF., Guo, J., Liu, QB. et al. A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome. Genes Genom 37, 347–355 (2015). https://doi.org/10.1007/s13258-014-0263-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13258-014-0263-0

Keywords

Navigation