Abstract
Protein coding gene annotation errors in prokaryotic genomes are accumulating continually in bioinformatics databases, while the update rate of genome annotation can not keep up with the explosive increasing genome sequences in most cases. Hence it is critical to manually rectify the genome annotation errors. In this paper, a hybrid strategy by combing the gene ab initio predicting programs and the over annotated gene re-annotation programs is proposed for re-annotation of the protein coding genes in prokaryotic genomes. Based on this strategy, the protein coding genes in Geobacter sulfurreducens PCA is comprehensively re-annotated. As a consequence, 16 hypothetical genes are annotated as non-coding sequences and 104 missing genes are retrieved as protein coding genes. Subsequent function analysis and sequences analysis show that the predicting results are much reliable and robust. Further application to other genomes show that this work can provide alternative tools for later post-process of prokaryotic genome annotations.
Similar content being viewed by others
References
Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M et al (2008) The RAST server: rapid annotations using subsystems technology. BMC Genom 9:75
Bakke P, Carney N, Deloache W, Gearing M, Ingvorsen K, Lotz M, McNair J, Penumetcha P, Simpson S, Voss L et al (2009) Evaluation of three automated genome annotations for halorhabdus utahensis. PLoS ONE 4:e6291
Besemer J, Lomsadze A, Borodovsky M (2001) GeneMarks: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607–2618
Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF et al (1997) The complete genome sequence of Escherichia coli K-12. Science 277:1453–1462
Brenner SE (1999) Errors in genome annotation. Trends Genet 15:132–133
Burset M, Guigo R (1996) Evaluation of gene structure prediction programs. Genomics 34:353–367
Chen LL, Ma BG, Gao N (2008) Reannotation of hypothetical ORFs in plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043. FEBS J 275:198–206
Chou KC, Zhang CT (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30:275–349
Delcher AL, Bratke KA, Powers EC, Salzberg SL (2007) Identifying bacterial genes and endosymbiont DNA with glimmer. Bioinformatics 23:673–679
Devos D, Valencia A (2001) Intrinsic errors in genome annotation. Trends Genet 17:429–431
Gao F, Zhang CT (2004) Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics 20:673–681
Gao N, Chen LL, Ji HF, Wang W, Chang JW, Gao B, Zhang L, Zhang SC, Zhang HY (2010) DIGAP—a database of improved gene annotation for phytopathogens. BMC Genom 11:54
Guo FB, Xiong L, Teng JL, Yuen KY, Lau SK, Woo PC (2013) Re-annotation of protein-coding genes in 10 complete genomes of Neisseriaceae family by combining similarity—based and composition—based methods. DNA Res 20:273–286
Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform 11:19
Kisand V, Lettieri T (2013) Genome sequencing of bacteria: sequencing, de novo assembly and rapid analysis using open source tools. BMC Genom 14:211
Krause L, McHardy AC, Nattkemper TW, Pühler A, Stoye J, Meyer F (2007) GISMO-gene identification using a support vector machine for ORF classification. Nucleic Acids Res 35:540–549
Kyrpides NC (2009) Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream. Nat Biotechnol 27:627–632
Li M, Wang J, Chen X, Wang H, Pan Y (2011) A local average connectivity-based method for identifying essential proteins from the network level. Comput Biol Chem 35:143–150
Liao B, Xiong Q, Li D (2012) Incorporating secondary features into the general form of Chou’s PseAAC for predicting protein structural class. Protein Peptide Lett 19:1133–1138
Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM, Kyrpides NC (2010) The genomes on line database (gold) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 38:D346–D354
Luo CW, Hu GQ, Zhu HQ (2009) Genome reannotation of Escherichia coli CFT073 with new insights into virulence. BMC Genom 10:552
Methé BA, Nelson KE, Eisen JA, Paulsen IT, Nelson W, Heidelberg JF, Wu D, Wu M, Ward N, Beanan MJ et al (2003) Genome of Geobacter sulfurreducens: metal reduction in subsurface environments. Science 302:1967–1969
Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L (2008) Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinform 9:353
Pallejà A, Harrington ED, Bork P (2008) Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? BMC Genom 9:335
Pati A, Ivanova NN, Mikhailova N, Ovchinnikova G, Hooper SD, Lykidis A, Kyrpides NC (2010) GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods 7:455–457
Petty NK (2010) Genome annotation: man versus machine. Nat Rev Microbiol 8:762
Poptsova MS, Gogarten JP (2010) Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiol-SGM 156:1909–1917
Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35:D61–D65
Qiu Y, Cho BK, Park YS, Lovley D, Palsson BØ, Zengler K (2010) Structural and operational complexity of the Geobacter sulfurreducens genome. Genome Res 20:1304–1311
Reed JL, Famili I, Thiele I, Palsson BO (2006) Towards multidimensional genome annotation. Nat Rev Genet 7:130–141
Reeves GA, Talavera D, Thornton JM (2009) Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface 6:129–147
Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 29:22–28
Ussery DW, Hallin PF (2004) Genome update: annotation quality in sequenced microbial genomes. Microbil-SGM 150:2015–2017
Wang Q, Lei Y, Xu X, Wang G, Chen LL (2013) Theoretical prediction and experimental verification of protein-coding genes in plant pathogen genome Agrobacterium tumefaciens strain C58. PLoS ONE 7:e43176
Warren AS, Archuleta J, Feng WC, Setubal JC (2010) Missing genes in the annotation of prokaryotic genomes. BMC Bioinform 11:131
Yu JF, Sun X (2010) Reannotation of protein-coding genes based on an improved graphical representation of DNA sequence. J Comput Chem 31:2126–2135
Yu JF, Sun X, Wang JH (2009) TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications. J Theor Biol 261:459–468
Yu JF, Xiao K, Jiang DK, Guo J, Wang JH, Sun X (2011) An integrative method for identifying the over-annotated protein-coding genes in microbial genomes. DNA Res 18:435–449
Yu JF, Jiang DK, Xiao K, Jin Y, Wang JH, Sun X (2012) Discriminate the falsely predicted protein-coding genes in Aeropyrum Pernix K1 genome based on graphical representation. MATCH Commun Math Comput Chem 67:845–866
Yu JF, Guo ZZ, Sun X, Wang JH (2014) A review of the computational methods for identifying the over-annotated genes and missing genes in microbial genomes. Current Bioinform 9:147–154
Acknowledgments
This work was supported by National Natural Science Foundation of China (Projects No. 61302186 and No. 61271378), Shandong Natural Science Foundation (Project No. ZR2010CQ041) and the funding from the State Key Laboratory of Bioelectronics of Southeast University.
Conflict of interest
The authors declare that there is no conflict of interest.
Author information
Authors and Affiliations
Corresponding authors
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Yu, JF., Guo, J., Liu, QB. et al. A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome. Genes Genom 37, 347–355 (2015). https://doi.org/10.1007/s13258-014-0263-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13258-014-0263-0