Skip to main content

ChemGenome2.1: An Ab Initio Gene Prediction Software

  • Protocol
  • First Online:
Gene Prediction

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1962))

Abstract

Gene prediction, also known as gene identification, gene finding, gene recognition, or gene discovery, is among one of the important problems of molecular biology and is receiving increasing attention due to the advent of large-scale genome sequencing projects. We designed an ab initio model (called ChemGenome) for gene prediction in prokaryotic genomes based on physicochemical characteristics of codons. In this chapter, we present the methodology of the latest version of this model ChemGenome2.1 (CG2.1). The first module of the protocol builds a three-dimensional vector from three calculated quantities for each codon—the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and non-genic regions to make a distinction feasible. The predicted putative protein-coding genes from above parameters are passed through a second module of the protocol which reduces the number of false positives by utilizing a filter based on stereochemical properties of protein sequences. The chemical properties of amino acid side chains taken into consideration are the presence of sp3 hybridized γ carbon atom, hydrogen bond donor ability, short/absence of δ carbon and linearity of the side chains/non-occurrence of bi-dentate forks with terminal hydrogen atoms in the side chain. The final prediction of the potential protein-coding genes is based on the frequency of occurrence of amino acids in the predicted protein sequences and their deviation from the frequency values of Swissprot protein sequences, both at monomer and tripeptide levels. The final screening is based on Z-score. Though CG2.1 is a gene finding tool for prokaryotes, considering the underlying similarity in the chemical and physical properties of DNA among prokaryotes and eukaryotes, we attempted to evaluate its applicability for gene finding in the lower eukaryotes. The results give a hope that the concept of gene finding based on physicochemical model of codons is a viable idea for eukaryotes as well, though, undoubtedly, improvements are needed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bordovsky MY, McIninch JD (1993) GENMARK: parallel gene recognition for both DNA strands. Comput Chem 17:123133

    Google Scholar 

  2. Lukashin AV, Bordovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:11071115

    Article  Google Scholar 

  3. Bordovsky M, McIninch JD, Koonin EV, Rudd KE, Medigue C, Danchin A (1995) Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res 23:3554–3562

    Article  Google Scholar 

  4. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1):7894

    Article  Google Scholar 

  5. Krogh A, Mian IS, Haussler D (1994) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res 22:4768–4778

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized Hidden Markov Model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134–142

    CAS  PubMed  Google Scholar 

  7. Meyer IM, Durbin R (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18:1309–1318

    Article  CAS  PubMed  Google Scholar 

  8. Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544–548

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Henderson J, Salzberg S, Fasman KH (1997) Finding genes in DNA with a Hidden Markov Model. J Comput Biol 4:127–141

    Article  CAS  PubMed  Google Scholar 

  10. Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R (1997) Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics 13:263–270

    Article  CAS  Google Scholar 

  11. Issac B, Singh H, Kaur H, Raghava GP (2002) Locating probable genes using Fourier transform approach. Bioinformatics 18:196–197

    Article  CAS  PubMed  Google Scholar 

  12. Snyder EE, Stormo GD (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res 21:607–613

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Xu Y, Uberbacher EC (1996) Gene prediction by pattern recognition and homology search. Proc Int Conf Intell Syst Mol Biol 4:241–251

    CAS  PubMed  Google Scholar 

  14. Rogozin IB, Milanesi L, Kolchanov NA (1996) Gene structure prediction using information on homologous protein sequence. Comput Appl Biosci 12:161–170

    CAS  PubMed  Google Scholar 

  15. Gotoh O (2000) Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics 16:190–202

    Article  CAS  PubMed  Google Scholar 

  16. Kan Z, Rouchka EC, Gish WR, States DJ (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 11:889–900

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Rinner O, Morgenstern B (2002) AGenDA: gene prediction by comparative sequence analysis. In Silico Biol 2:195–205

    CAS  PubMed  Google Scholar 

  18. Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T (2003) EUGENE’HOM: a generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res 31:3742–3745

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17:S140–S148

    Article  PubMed  Google Scholar 

  20. Solovyev VV, Salamov AA, Lawrence CB (1995) Identification of human gene structure using linear discriminant functions and dynamic programming. Proc Int Conf Intell Syst Mol Biol 3:367–375

    CAS  PubMed  Google Scholar 

  21. Chen T, Zhang MQ (1998) Pombe: a gene-finding and exonintron structure prediction system for fission yeast. Yeast 14:701–710

    Article  CAS  PubMed  Google Scholar 

  22. Yeramian E, Bonnefoy S, Langsley G (2002) Physics-based gene identification: proof of concept for Plasmodium falciparum. Bioinformatics 18:190–193

    Article  CAS  PubMed  Google Scholar 

  23. Rogic S, Ouellette BF, Mackworth AK (2002) Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics 18:10341045

    Article  Google Scholar 

  24. Guo FB, Ou HY, Zhang CT (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31:1780–1789

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Zhang CT, Zhang R (2002) Evaluation of gene-finding algorithms by a content-balancing accuracy index. J Biomol Struct Dyn 19:1045–1052

    Article  CAS  PubMed  Google Scholar 

  26. Mathe C, Sagot MF, Schiex T, Rouze P (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 30:4103–4117

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Rogic S, Mackworth AK, Ouellette FB (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res 11:817–832

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Tanizawa Y, Fujisawa T, Nakamura Y (2018) DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 34(6):1037–1039

    Article  CAS  PubMed  Google Scholar 

  29. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069

    Article  CAS  PubMed  Google Scholar 

  30. Tatusova T et al (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614–6624

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Sugawara H et al (2009) Microbial genome annotation pipeline (MiGAP) for diverse users. Proceedings of the 20th international conference on genome informatics, Yokohama, S-001-1-2

    Google Scholar 

  32. Dutta S, Singhal P, Agrawal P, Tomer R, Kritee Khurana E, Jayaram B (2006) A physico-chemical model for analyzing DNA sequences. J Chem Inf Model 46:78–85

    Article  CAS  PubMed  Google Scholar 

  33. Singhal P et al (2008) Prokaryotic gene finding based on physicochemical characteristics of codons calculated from molecular dynamics simulations. Biophys J 94:4173–4183

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Khandelwal G, Jayaram B (2010) A phenomenological model for predicting melting temperatures of DNA sequences. PLoS One 5:e12433

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. Khandelwal G, Gupta J, Jayaram B (2012) DNA energetics based analyses suggest additional genes in prokaryotes. J Bio Sci 37:433–444

    CAS  Google Scholar 

  36. Khandelwal G, Jayaram B (2012) DNA-water interactions distinguish messenger RNA genes from transfer RNA genes. J Am Chem Soc 134:8814–8816

    Article  CAS  PubMed  Google Scholar 

  37. Khandelwal G, Lee RA, Jayaram B, Beveridge DL (2014) A statistical thermodynamic model for investigating the stability of DNA sequences from oligonucleotides to genomes. Biophys J 106(11):2465–2473

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Singh A, Mishra A, Khosravi A, Khandelwal G, Jayaram B (2016) Physico-chemical fingerprinting of RNA genes. Nucleic Acids Res 1:gkw1236

    Google Scholar 

  39. Jayaram B (1997) Beyond the wobble: the rule of conjugates. J Mol Evol 45:704–705

    Article  CAS  PubMed  Google Scholar 

  40. Mishra A, Siwach P, Misra P, Jayaram B, Bansal M, Olson WK, Thayer KM, Beveridge DL (2018) Towards a universal structural and energetic model for prokaryotic promoters. Biophys J 115:1180–1189

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Beveridge DL et al (2004) Molecular dynamics simulations of the 136 unique tetranucleotide sequences of DNA oligonucleotides. I. Research design and results on d (CpG) steps. Biophys J 87:799–813

    Article  CAS  Google Scholar 

  42. Dixit S et al (2005) Molecular dynamics simulations of the 136 unique tetranucleotide sequences of DNA oligonucleotides. II: sequence context effects on the dynamical structures of the 10 unique dinucleotide steps. Biophys J 89:3721–3740

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Fickett JW (1995) ORFs and genes: how strong a connection? J Comput Biol 2:117–123

    Article  CAS  PubMed  Google Scholar 

  44. Case DA et al (1999) AMBER: Version 6. Version 6.0. University of California, San Francisco, CA

    Google Scholar 

  45. Gallant SI (1990) Perceptron-based learning algorithm. IEEE Trans Neural Netw 2(2):179–191

    Article  Google Scholar 

  46. Rosenblatt F (1961) Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Spartan Press, Washington, DC

    Book  Google Scholar 

  47. Jayaram B (2008) Decoding the design principles of amino acids and the chemical logic of protein sequences. Nat Preced. http://hdl.handle.net/10101/npre.2008.2135.1

  48. Kaushik R, Singh A, Jayaram B (2018) Where informatics lags chemistry leads. Biochemistry 55(5):503–505

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Jayaram .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Mishra, A., Siwach, P., Singhal, P., Jayaram, B. (2019). ChemGenome2.1: An Ab Initio Gene Prediction Software. In: Kollmar, M. (eds) Gene Prediction. Methods in Molecular Biology, vol 1962. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9173-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-9173-0_7

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-4939-9172-3

  • Online ISBN: 978-1-4939-9173-0

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics