Abstract
Gene prediction, also known as gene identification, gene finding, gene recognition, or gene discovery, is among one of the important problems of molecular biology and is receiving increasing attention due to the advent of large-scale genome sequencing projects. We designed an ab initio model (called ChemGenome) for gene prediction in prokaryotic genomes based on physicochemical characteristics of codons. In this chapter, we present the methodology of the latest version of this model ChemGenome2.1 (CG2.1). The first module of the protocol builds a three-dimensional vector from three calculated quantities for each codon—the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and non-genic regions to make a distinction feasible. The predicted putative protein-coding genes from above parameters are passed through a second module of the protocol which reduces the number of false positives by utilizing a filter based on stereochemical properties of protein sequences. The chemical properties of amino acid side chains taken into consideration are the presence of sp3 hybridized γ carbon atom, hydrogen bond donor ability, short/absence of δ carbon and linearity of the side chains/non-occurrence of bi-dentate forks with terminal hydrogen atoms in the side chain. The final prediction of the potential protein-coding genes is based on the frequency of occurrence of amino acids in the predicted protein sequences and their deviation from the frequency values of Swissprot protein sequences, both at monomer and tripeptide levels. The final screening is based on Z-score. Though CG2.1 is a gene finding tool for prokaryotes, considering the underlying similarity in the chemical and physical properties of DNA among prokaryotes and eukaryotes, we attempted to evaluate its applicability for gene finding in the lower eukaryotes. The results give a hope that the concept of gene finding based on physicochemical model of codons is a viable idea for eukaryotes as well, though, undoubtedly, improvements are needed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bordovsky MY, McIninch JD (1993) GENMARK: parallel gene recognition for both DNA strands. Comput Chem 17:123133
Lukashin AV, Bordovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:11071115
Bordovsky M, McIninch JD, Koonin EV, Rudd KE, Medigue C, Danchin A (1995) Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res 23:3554–3562
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1):7894
Krogh A, Mian IS, Haussler D (1994) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res 22:4768–4778
Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized Hidden Markov Model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134–142
Meyer IM, Durbin R (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18:1309–1318
Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544–548
Henderson J, Salzberg S, Fasman KH (1997) Finding genes in DNA with a Hidden Markov Model. J Comput Biol 4:127–141
Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R (1997) Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics 13:263–270
Issac B, Singh H, Kaur H, Raghava GP (2002) Locating probable genes using Fourier transform approach. Bioinformatics 18:196–197
Snyder EE, Stormo GD (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res 21:607–613
Xu Y, Uberbacher EC (1996) Gene prediction by pattern recognition and homology search. Proc Int Conf Intell Syst Mol Biol 4:241–251
Rogozin IB, Milanesi L, Kolchanov NA (1996) Gene structure prediction using information on homologous protein sequence. Comput Appl Biosci 12:161–170
Gotoh O (2000) Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics 16:190–202
Kan Z, Rouchka EC, Gish WR, States DJ (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 11:889–900
Rinner O, Morgenstern B (2002) AGenDA: gene prediction by comparative sequence analysis. In Silico Biol 2:195–205
Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T (2003) EUGENE’HOM: a generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res 31:3742–3745
Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17:S140–S148
Solovyev VV, Salamov AA, Lawrence CB (1995) Identification of human gene structure using linear discriminant functions and dynamic programming. Proc Int Conf Intell Syst Mol Biol 3:367–375
Chen T, Zhang MQ (1998) Pombe: a gene-finding and exonintron structure prediction system for fission yeast. Yeast 14:701–710
Yeramian E, Bonnefoy S, Langsley G (2002) Physics-based gene identification: proof of concept for Plasmodium falciparum. Bioinformatics 18:190–193
Rogic S, Ouellette BF, Mackworth AK (2002) Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics 18:10341045
Guo FB, Ou HY, Zhang CT (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31:1780–1789
Zhang CT, Zhang R (2002) Evaluation of gene-finding algorithms by a content-balancing accuracy index. J Biomol Struct Dyn 19:1045–1052
Mathe C, Sagot MF, Schiex T, Rouze P (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 30:4103–4117
Rogic S, Mackworth AK, Ouellette FB (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res 11:817–832
Tanizawa Y, Fujisawa T, Nakamura Y (2018) DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 34(6):1037–1039
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069
Tatusova T et al (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614–6624
Sugawara H et al (2009) Microbial genome annotation pipeline (MiGAP) for diverse users. Proceedings of the 20th international conference on genome informatics, Yokohama, S-001-1-2
Dutta S, Singhal P, Agrawal P, Tomer R, Kritee Khurana E, Jayaram B (2006) A physico-chemical model for analyzing DNA sequences. J Chem Inf Model 46:78–85
Singhal P et al (2008) Prokaryotic gene finding based on physicochemical characteristics of codons calculated from molecular dynamics simulations. Biophys J 94:4173–4183
Khandelwal G, Jayaram B (2010) A phenomenological model for predicting melting temperatures of DNA sequences. PLoS One 5:e12433
Khandelwal G, Gupta J, Jayaram B (2012) DNA energetics based analyses suggest additional genes in prokaryotes. J Bio Sci 37:433–444
Khandelwal G, Jayaram B (2012) DNA-water interactions distinguish messenger RNA genes from transfer RNA genes. J Am Chem Soc 134:8814–8816
Khandelwal G, Lee RA, Jayaram B, Beveridge DL (2014) A statistical thermodynamic model for investigating the stability of DNA sequences from oligonucleotides to genomes. Biophys J 106(11):2465–2473
Singh A, Mishra A, Khosravi A, Khandelwal G, Jayaram B (2016) Physico-chemical fingerprinting of RNA genes. Nucleic Acids Res 1:gkw1236
Jayaram B (1997) Beyond the wobble: the rule of conjugates. J Mol Evol 45:704–705
Mishra A, Siwach P, Misra P, Jayaram B, Bansal M, Olson WK, Thayer KM, Beveridge DL (2018) Towards a universal structural and energetic model for prokaryotic promoters. Biophys J 115:1180–1189
Beveridge DL et al (2004) Molecular dynamics simulations of the 136 unique tetranucleotide sequences of DNA oligonucleotides. I. Research design and results on d (CpG) steps. Biophys J 87:799–813
Dixit S et al (2005) Molecular dynamics simulations of the 136 unique tetranucleotide sequences of DNA oligonucleotides. II: sequence context effects on the dynamical structures of the 10 unique dinucleotide steps. Biophys J 89:3721–3740
Fickett JW (1995) ORFs and genes: how strong a connection? J Comput Biol 2:117–123
Case DA et al (1999) AMBER: Version 6. Version 6.0. University of California, San Francisco, CA
Gallant SI (1990) Perceptron-based learning algorithm. IEEE Trans Neural Netw 2(2):179–191
Rosenblatt F (1961) Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Spartan Press, Washington, DC
Jayaram B (2008) Decoding the design principles of amino acids and the chemical logic of protein sequences. Nat Preced. http://hdl.handle.net/10101/npre.2008.2135.1
Kaushik R, Singh A, Jayaram B (2018) Where informatics lags chemistry leads. Biochemistry 55(5):503–505
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Mishra, A., Siwach, P., Singhal, P., Jayaram, B. (2019). ChemGenome2.1: An Ab Initio Gene Prediction Software. In: Kollmar, M. (eds) Gene Prediction. Methods in Molecular Biology, vol 1962. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9173-0_7
Download citation
DOI: https://doi.org/10.1007/978-1-4939-9173-0_7
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9172-3
Online ISBN: 978-1-4939-9173-0
eBook Packages: Springer Protocols