ChemGenome2.1: An Ab Initio Gene Prediction Software

Mishra, Akhilesh; Siwach, Priyanka; Singhal, Poonam; Jayaram, B.

doi:10.1007/978-1-4939-9173-0_7

Akhilesh Mishra^3,4,
Priyanka Siwach^3,5,
Poonam Singhal³ &
…
B. Jayaram^3,4,6

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1962))

2950 Accesses
1 Citations
1 Altmetric

Abstract

Gene prediction, also known as gene identification, gene finding, gene recognition, or gene discovery, is among one of the important problems of molecular biology and is receiving increasing attention due to the advent of large-scale genome sequencing projects. We designed an ab initio model (called ChemGenome) for gene prediction in prokaryotic genomes based on physicochemical characteristics of codons. In this chapter, we present the methodology of the latest version of this model ChemGenome2.1 (CG2.1). The first module of the protocol builds a three-dimensional vector from three calculated quantities for each codon—the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and non-genic regions to make a distinction feasible. The predicted putative protein-coding genes from above parameters are passed through a second module of the protocol which reduces the number of false positives by utilizing a filter based on stereochemical properties of protein sequences. The chemical properties of amino acid side chains taken into consideration are the presence of sp3 hybridized γ carbon atom, hydrogen bond donor ability, short/absence of δ carbon and linearity of the side chains/non-occurrence of bi-dentate forks with terminal hydrogen atoms in the side chain. The final prediction of the potential protein-coding genes is based on the frequency of occurrence of amino acids in the predicted protein sequences and their deviation from the frequency values of Swissprot protein sequences, both at monomer and tripeptide levels. The final screening is based on Z-score. Though CG2.1 is a gene finding tool for prokaryotes, considering the underlying similarity in the chemical and physical properties of DNA among prokaryotes and eukaryotes, we attempted to evaluate its applicability for gene finding in the lower eukaryotes. The results give a hope that the concept of gene finding based on physicochemical model of codons is a viable idea for eukaryotes as well, though, undoubtedly, improvements are needed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bordovsky MY, McIninch JD (1993) GENMARK: parallel gene recognition for both DNA strands. Comput Chem 17:123133
Google Scholar
Lukashin AV, Bordovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:11071115
Article Google Scholar
Bordovsky M, McIninch JD, Koonin EV, Rudd KE, Medigue C, Danchin A (1995) Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res 23:3554–3562
Article Google Scholar
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1):7894
Article Google Scholar
Krogh A, Mian IS, Haussler D (1994) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res 22:4768–4778
Article CAS PubMed PubMed Central Google Scholar
Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized Hidden Markov Model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134–142
CAS PubMed Google Scholar
Meyer IM, Durbin R (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18:1309–1318
Article CAS PubMed Google Scholar
Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544–548
Article CAS PubMed PubMed Central Google Scholar
Henderson J, Salzberg S, Fasman KH (1997) Finding genes in DNA with a Hidden Markov Model. J Comput Biol 4:127–141
Article CAS PubMed Google Scholar
Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R (1997) Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics 13:263–270
Article CAS Google Scholar
Issac B, Singh H, Kaur H, Raghava GP (2002) Locating probable genes using Fourier transform approach. Bioinformatics 18:196–197
Article CAS PubMed Google Scholar
Snyder EE, Stormo GD (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res 21:607–613
Article CAS PubMed PubMed Central Google Scholar
Xu Y, Uberbacher EC (1996) Gene prediction by pattern recognition and homology search. Proc Int Conf Intell Syst Mol Biol 4:241–251
CAS PubMed Google Scholar
Rogozin IB, Milanesi L, Kolchanov NA (1996) Gene structure prediction using information on homologous protein sequence. Comput Appl Biosci 12:161–170
CAS PubMed Google Scholar
Gotoh O (2000) Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics 16:190–202
Article CAS PubMed Google Scholar
Kan Z, Rouchka EC, Gish WR, States DJ (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 11:889–900
Article CAS PubMed PubMed Central Google Scholar
Rinner O, Morgenstern B (2002) AGenDA: gene prediction by comparative sequence analysis. In Silico Biol 2:195–205
CAS PubMed Google Scholar
Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T (2003) EUGENE’HOM: a generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res 31:3742–3745
Article CAS PubMed PubMed Central Google Scholar
Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17:S140–S148
Article PubMed Google Scholar
Solovyev VV, Salamov AA, Lawrence CB (1995) Identification of human gene structure using linear discriminant functions and dynamic programming. Proc Int Conf Intell Syst Mol Biol 3:367–375
CAS PubMed Google Scholar
Chen T, Zhang MQ (1998) Pombe: a gene-finding and exonintron structure prediction system for fission yeast. Yeast 14:701–710
Article CAS PubMed Google Scholar
Yeramian E, Bonnefoy S, Langsley G (2002) Physics-based gene identification: proof of concept for Plasmodium falciparum. Bioinformatics 18:190–193
Article CAS PubMed Google Scholar
Rogic S, Ouellette BF, Mackworth AK (2002) Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics 18:10341045
Article Google Scholar
Guo FB, Ou HY, Zhang CT (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31:1780–1789
Article CAS PubMed PubMed Central Google Scholar
Zhang CT, Zhang R (2002) Evaluation of gene-finding algorithms by a content-balancing accuracy index. J Biomol Struct Dyn 19:1045–1052
Article CAS PubMed Google Scholar
Mathe C, Sagot MF, Schiex T, Rouze P (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 30:4103–4117
Article CAS PubMed PubMed Central Google Scholar
Rogic S, Mackworth AK, Ouellette FB (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res 11:817–832
Article CAS PubMed PubMed Central Google Scholar
Tanizawa Y, Fujisawa T, Nakamura Y (2018) DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 34(6):1037–1039
Article CAS PubMed Google Scholar
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069
Article CAS PubMed Google Scholar
Tatusova T et al (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614–6624
Article CAS PubMed PubMed Central Google Scholar
Sugawara H et al (2009) Microbial genome annotation pipeline (MiGAP) for diverse users. Proceedings of the 20th international conference on genome informatics, Yokohama, S-001-1-2
Google Scholar
Dutta S, Singhal P, Agrawal P, Tomer R, Kritee Khurana E, Jayaram B (2006) A physico-chemical model for analyzing DNA sequences. J Chem Inf Model 46:78–85
Article CAS PubMed Google Scholar
Singhal P et al (2008) Prokaryotic gene finding based on physicochemical characteristics of codons calculated from molecular dynamics simulations. Biophys J 94:4173–4183
Article CAS PubMed PubMed Central Google Scholar
Khandelwal G, Jayaram B (2010) A phenomenological model for predicting melting temperatures of DNA sequences. PLoS One 5:e12433
Article PubMed PubMed Central CAS Google Scholar
Khandelwal G, Gupta J, Jayaram B (2012) DNA energetics based analyses suggest additional genes in prokaryotes. J Bio Sci 37:433–444
CAS Google Scholar
Khandelwal G, Jayaram B (2012) DNA-water interactions distinguish messenger RNA genes from transfer RNA genes. J Am Chem Soc 134:8814–8816
Article CAS PubMed Google Scholar
Khandelwal G, Lee RA, Jayaram B, Beveridge DL (2014) A statistical thermodynamic model for investigating the stability of DNA sequences from oligonucleotides to genomes. Biophys J 106(11):2465–2473
Article CAS PubMed PubMed Central Google Scholar
Singh A, Mishra A, Khosravi A, Khandelwal G, Jayaram B (2016) Physico-chemical fingerprinting of RNA genes. Nucleic Acids Res 1:gkw1236
Google Scholar
Jayaram B (1997) Beyond the wobble: the rule of conjugates. J Mol Evol 45:704–705
Article CAS PubMed Google Scholar
Mishra A, Siwach P, Misra P, Jayaram B, Bansal M, Olson WK, Thayer KM, Beveridge DL (2018) Towards a universal structural and energetic model for prokaryotic promoters. Biophys J 115:1180–1189
Article CAS PubMed PubMed Central Google Scholar
Beveridge DL et al (2004) Molecular dynamics simulations of the 136 unique tetranucleotide sequences of DNA oligonucleotides. I. Research design and results on d (CpG) steps. Biophys J 87:799–813
Article CAS Google Scholar
Dixit S et al (2005) Molecular dynamics simulations of the 136 unique tetranucleotide sequences of DNA oligonucleotides. II: sequence context effects on the dynamical structures of the 10 unique dinucleotide steps. Biophys J 89:3721–3740
Article CAS PubMed PubMed Central Google Scholar
Fickett JW (1995) ORFs and genes: how strong a connection? J Comput Biol 2:117–123
Article CAS PubMed Google Scholar
Case DA et al (1999) AMBER: Version 6. Version 6.0. University of California, San Francisco, CA
Google Scholar
Gallant SI (1990) Perceptron-based learning algorithm. IEEE Trans Neural Netw 2(2):179–191
Article Google Scholar
Rosenblatt F (1961) Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Spartan Press, Washington, DC
Book Google Scholar
Jayaram B (2008) Decoding the design principles of amino acids and the chemical logic of protein sequences. Nat Preced. http://hdl.handle.net/10101/npre.2008.2135.1
Kaushik R, Singh A, Jayaram B (2018) Where informatics lags chemistry leads. Biochemistry 55(5):503–505
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
Akhilesh Mishra, Priyanka Siwach, Poonam Singhal & B. Jayaram
Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India
Akhilesh Mishra & B. Jayaram
Department of Biotechnology, Chaudhary Devi Lal University, Sirsa, Haryana, India
Priyanka Siwach
Department of Chemistry, Indian Institute of Technology Delhi, New Delhi, India
B. Jayaram

Authors

Akhilesh Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Priyanka Siwach
View author publications
You can also search for this author in PubMed Google Scholar
Poonam Singhal
View author publications
You can also search for this author in PubMed Google Scholar
B. Jayaram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Jayaram .

Editor information

Editors and Affiliations

Group Systems Biology of Motor Proteins, Department NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Goettingen, Germany
Martin Kollmar

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Mishra, A., Siwach, P., Singhal, P., Jayaram, B. (2019). ChemGenome2.1: An Ab Initio Gene Prediction Software. In: Kollmar, M. (eds) Gene Prediction. Methods in Molecular Biology, vol 1962. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9173-0_7

Download citation

DOI: https://doi.org/10.1007/978-1-4939-9173-0_7
Published: 25 April 2019
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9172-3
Online ISBN: 978-1-4939-9173-0
eBook Packages: Springer Protocols

Publish with us

Policies and ethics