Abstract
We introduce here a gene evolution model which is an extension of the time-continuous stochastic IDIS model (Lèbre and Michel in J. Comput. Biol. Chem. 34:259–267, 2010) to sequence length. This new IDISL (Insertion Deletion Independent of Substitution based on sequence Length) model gives an analytical expression of the residue occurrence probability p(l) at sequence length l depending on stochastically independent processes of substitution, insertion, and deletion. Furthermore, in contrast to all mathematical models in this research field, the substitution, insertion, and deletion parameters of the IDISL model are independent of each other. For any diagonalizable substitution matrix M, the residue occurrence probability p(l) is given as a function of the eigenvalues of M, the eigenvector matrix of M, a vector r of the residue insertion rates, a deletion rate d (unlike our previous IDIS model), and a vector of the initial residue occurrence probability p(l 0) at sequence length l 0.
As another difference with the classical evolution approaches which mainly focus on sequence alignment, the IDIS class of models allows a mathematical analysis of the behavior of the residue occurrence probability according to either evolution time or sequence length. The length parameter can be associated with any nucleotide regions: genes, genomes, introns, repeats, 5′ and 3′ regions, etc. Three properties of the IDISL model are given in relation with the sequence length l: parameter scale, inverse evolution, and residue equilibrium distribution. Nucleotide occurrence probabilities are given in the particular case of the IDISL-HKY model, i.e. the IDISL model associated with the HKY asymmetric substitution matrix (Hasegawa et al. in J. Mol. Evol. 22:160–174, 1985).
An application of the IDISL model is developed for a massive statistical analysis of GC content in all complete bacterial genomes available to date (894 non-anaerobic and anaerobic genomes). The IDISL-HKY model confirms the increase of the GC content with the genome length for two non-anaerobic taxonomic groups of bacterial genomes. Moreover, the non-linear modelling proposed by the IDISL model outperforms the most recent modelling of GC content in these bacterial genomes (Wang et al. in Biochem. Biophys. Res. Commun. 342:681–684, 2006; Musto et al. in Biochem. Biophys. Res. Commun. 347:1–3, 2006).
Similar content being viewed by others
References
Aldous, D., & Fill, J. A. (2002). Reversible Markov chains and random walks on graphs. Berkeley: University of California.
Arquès, D. G., & Michel, C. J. (1993). Analytical expression of the purine/pyrimidine codon probability after and before random mutations. Bull. Math. Biol., 55, 1025–1038.
Arquès, D. G., & Michel, C. J. (1995). Analytical solutions of the dinucleotide probability after and before random mutations. J. Theor. Biol., 175, 533–544.
Bastolla, U., Moya, A., Viguera, E., & van Ham, R. C. (2004). Genomic determinants of protein folding thermodynamics in prokaryotic organisms. J. Mol. Biol., 343, 1451–1466.
Benard, E., & Michel, C. J. (2009). Computation of direct and inverse mutations with the SEGM web server (Stochastic Evolution of Genetic Motifs): an application to splice sites of human genome introns. Comput. Biol. Chem., 33, 245–252.
Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. London: Chapman & Hall.
Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17, 368–376.
Felsenstein, J., & Churchill, G. A. (1996). A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol., 13, 93–104.
Foerstner, K. U., von Mering, C., Hooper, S. D., & Bork, P. (2005). Environments shape the nucleotide composition of genomes. EMBO Rep., 6, 1208–1213.
Freese, E. (1962). On the evolution of base composition of DNA. J. Theor. Biol., 3, 82–101.
Giraud, A., Matic, I., Tenaillon, O., Clara, A., Radman, M., Fons, M., & Taddei, F. (2001). Costs and benefits of high mutation rates: adaptive evolution of bacteria in the mouse gut. Science, 291, 2606–2608.
Hasegawa, M., Kishino, H., & Yano, T. (1985). Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol., 22, 160–174.
Jukes, T. H., & Cantor, C. R. (1969). Evolution of protein molecules. In H. N. Munro (Ed.), Mammalian protein metabolism (pp. 21–132). New York: Academic Press.
Kelly, F. P. (1979). Reversibility and stochastic networks. Chichester: Wiley.
Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16, 111–120.
Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA, 78, 454–458.
Koonin, E. V., & Wolf, Y. I. (2008). Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res., 36, 6688–6719.
Lèbre, S., & Michel, C. J. (2010). A stochastic evolution model for residue insertion–deletion independent from substitution. Comput. Biol. Chem., 34, 259–267.
Lee, K. Y., Wahl, R., & Barbu, E. (1956). Contenu en bases puriques et pyrimidiques des acides désoxyribonucléiques des bactéries. Ann. Inst. Pasteur, 91, 212–224.
Malthus, T. R. (2000). An essay on the principle of population. Library of Economics, Liberty, Fund, Inc.
McGuire, G., Denham, M. C., & Balding, D. J. (2001). Models of sequence evolution for DNA sequences containing gaps. Mol. Biol. Evol., 18, 481–490.
Metzler, D. (2003). Statistical alignment based on fragment insertion and deletion models. Bioinformatics, 19, 490–499.
Michel, C. J. (2007). An analytical model of gene evolution with 9 mutation parameters: an application to the amino acids coded by the common circular code. Bull. Math. Biol., 69, 677–698.
Miklós, I., Lunter, G. A., & Holmes, I. (2004). A “long indel” model for evolutionary sequence alignment. Mol. Biol. Evol., 21, 529–540.
Miklós, I., Novák, A., Satija, R., Lyngsø, R., & Hein, J. (2009). Stochastic models of sequence evolution including insertion–deletion events. Stat. Methods Med. Res., 18, 453–485.
Moran, N. A. (1962). Microbial minimalism: genome reduction in bacterial pathogens. Cell, 108, 583–586.
Musto, H., Naya, H., Zavala, A., Romero, H., Alvarez-Valín, F., & Bernardi, G. (2006). Genomic GC level, optimal growth temperature, and genome size in prokaryotes. Biochem. Biophys. Res. Commun., 347, 1–3.
Rivas, E. (2005). Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinform., 6, 63.
Rivas, E., & Eddy, S. R. (2008). Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput. Biol., 4(9), e1000172.
Rocha, E. P., & Danchin, A. (2002). Base composition bias might result from competition for metabolic resources. Trends Genet., 18, 291–294.
Satapathy, S. S., Dutta, M., & Ray, S. K. (2010). Variable correlation of genome GC% with transfer RNA number as well as with transfer RNA diversity among bacterial groups: a-Proteobacteria and Tenericutes exhibit strong positive correlation. Microbiol. Res., 165, 232–242.
Sueoka, N. (1962). On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA, 48, 582–592.
Tavaré, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci., 17, 57–86.
Takahata, N., & Kimura, M. (1981). A model of evolutionary base substitutions and its application with special reference to rapid change of pseudogenes. Genetics, 98, 641–657.
Tamura, K., & Nei, M. (1993). Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol., 10, 512–526.
Thorne, J. L., Kishino, H., & Felsenstein, J. (1991). An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol., 33, 114–124.
Thorne, J. L., Kishino, H., & Felsenstein, J. (1992). Inching toward reality: an improved likelihood model of sequence evolution. J. Mol. Evol., 34, 3–16.
Wang, H. C., Susko, E., & Roger, A. J. (2006). On the correlation between genomic G+C content and optimal growth temperature in prokaryotes: data quality and confounding factors. Biochem. Biophys. Res. Commun., 342, 681–684.
Yang, Z. (1994). Estimating the pattern of nucleotide substitution. J. Mol. Evol., 39, 105–111.
Acknowledgement
We thank the three reviewers for their advice.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lèbre, S., Michel, C.J. An Evolution Model for Sequence Length Based on Residue Insertion–Deletion Independent of Substitution: An Application to the GC Content in Bacterial Genomes. Bull Math Biol 74, 1764–1788 (2012). https://doi.org/10.1007/s11538-012-9735-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-012-9735-z