Abstract
The past decade has seen a growing interest in evolutionary models that relax the assumption of site-independent evolution for non-coding sequences. While phylogenetic inference using such so-called context-dependent models is currently computationally prohibitive, these models have been shown to yield significant increases in model fit compared to site-independent evolutionary models, which remain the most widely used evolutionary models to study substitution patterns and perform phylogenetic inference. Context-dependent models have been shown to be suited to study the spontaneous deamination of cytosine in mammalian sequences. In this paper, I discuss various approaches presented in recent years to model context-dependent evolution. I start with discussing the empirical research and results that have led to the development of these models. To accurately estimate the context-dependent substitution patterns that arise from these models, accurate sampling of substitution histories under such models is required. Further, appropriate model selection techniques to assess model performance has become more important than ever, given the drastic increase in parameters of context-dependent models and the tendency of older model selection techniques to prefer parameter-rich models. I also present new results on two mammalian datasets (Primate and Laurasiatheria data) to shed a light on so-called lineage-dependent context-dependent evolution. I conclude this paper with a discussion on current challenges in the development of context-dependent modeling approaches.
Similar content being viewed by others
References
Akaike, H. (1974). New look at statistical-model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
Arndt, P. F., Burge, C. B., & Hwa, T. (2003). DNA sequence evolution with neighbor-dependent mutation. Journal of Computational Biology, 10(3–4), 313–322.
Arndt, P. F., & Hwa, T. (2005). Identification and measurement of neighbor-dependent nucleotide substitution processes. Bioinformatics, 21(10), 2322–2328.
Baele, G., Van de Peer, Y., & Vansteelandt, S. (2008). A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. Systematic Biology, 57(5), 675–692.
Baele, G., Van de Peer, Y., & Vansteelandt, S. (2009). Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences. BMC Evolutionary Biology, 9, 87.
Baele, G., Van de Peer, Y., & Vansteelandt, S. (2010a). Modelling the ancestral sequence distribution and model frequencies in context-dependent models for primate non-coding sequences. BMC Evolutionary Biology, 10, 244.
Baele, G., Van de Peer, Y., & Vansteelandt, S. (2010b). Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequences. Journal of Molecular Evolution, 71(1), 34–50.
Baldauf, S. L. (2003). Phylogeny for the faint of heart: A tutorial. Trends in Genetics, 19(6), 345–351.
Berard, J., Gouere, J. B., & Piau, D. (2008). Solvable models of neighbor-dependent substitution processes. Mathematical Biosciences, 211(1), 56–88.
Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley.
Bird, A. P. (1980). DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Research, 8(7), 1499–1504.
Blaisdell, B. E. (1985). A method of estimating from two aligned present-day DNA sequences their ancestral composition and subsequent rates of substitution, possibly different in the two lineages, corrected for multiple and parallel substitutions at the same site. Journal of Molecular Evolution, 22(1), 69–81.
Blake, R. D., Hess, S. T., & Nicholson-Tuell, J. (1992). The influence of nearest neighbors on the rate and pattern of spontaneous point mutations. Journal of Molecular Evolution, 34(3), 189–200.
Bollback, J. P. (2002). Bayesian model adequacy and choice in phylogenetics. Molecular Biology and Evolution, 19(7), 1171–1180.
Bulmer, M. (1986). Neighboring base effects on substitution rates in pseudogenes. Molecular Biology and Evolution, 3(4), 322–329.
Christensen, O. F., Hobolth, A., & Jensen, J. L. (2005). Pseudo-likelihood analysis of codon substitution models with neighbor-dependent rates. Journal of Computational Biology, 12(9), 1166–1182.
de Koning, A. P., Gu, W., & Pollock, D. D. (2010). Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories. Molecular Biology and Evolution, 27(2), 249–265.
Deforche, K., Camacho, R., Laethem, K. V., Shapiro, B., Moreau, Y., Rambaut, A., et al. (2007). Estimating the relative contribution of dNTP pool imbalance and APOBEC3G/3F editing to HIV evolution in vivo. Journal of Computational Biology, 14(8), 1105–1114.
Duncan, B. K., & Miller, J. H. (1980). Mutagenic deamination of cytosine residues in DNA. Nature, 287(5782), 560–561.
Duret, L., & Galtier, N. (2000). The covariation between TpA deficiency, CpG deficiency, and G+C content of human isochores is due to a mathematical artifact. Molecular Biology and Evolution, 17(11), 1620–1625.
Erickson, J. W., & Altman, G. (1979). A search for patterns in the nucleotide sequence of the MS2 genome. Journal of Mathematical Biology, 7, 219–230.
Fan, Y., Wu, R., Chen, M. H., Kuo, L., & Lewis, P. O. (2011). Choosing among partition models in Bayesian phylogenetics. Molecular Biology and Evolution, 28(1), 523–532.
Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6), 368–376.
Felsenstein, J. (1995). PHYLIP (Phylogenetic inference package) ver. 3.57.
Felsenstein, J. (2004). Inferring phylogenies. Sunderland, Mass: Sinauer Associates.
Felsenstein, J., & Churchill, G. A. (1996). A hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution, 13(1), 93–104.
Friel, N., & Petitt, A. N. (2008). Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B, 70, 589–607.
Fryxell, K. J., & Moon, W. J. (2005). CpG mutation rates in the human genome are highly dependent on local GC content. Molecular Biology and Evolution, 22(3), 650–658.
Fryxell, K. J., & Zuckerkandl, E. (2000). Cytosine deamination plays a primary role in the evolution of mammalian isochores. Molecular Biology and Evolution, 17(9), 1371–1383.
Gascuel, O., Steel, M. A. (2007). Reconstructing evolution: New mathematical and computational advances (Vol. xxix). Oxford; New York: Oxford University Press.
Gaut, B. S., & Lewis, P. O. (1995). Success of maximum likelihood phylogeny inference in the four-taxon case. Molecular Biology and Evolution, 12(1), 152–162.
Gelfand, A. E., & Meng, X.-L. (1996). Model checking and model improvement (pp. 189–198). Chapman and Hall: New York.
Gelman, A., & Meng, X.-L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science, 13, 163–185.
Geyer, C. J. (1992). Practical Markov chain Monte Carlo. Statistical Science, 7(4), 473–483.
Gojobori, T., Ishii, K., & Nei, M. (1982). Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide. Journal of Molecular Evolution, 18(6), 414–423.
Goldman, N., & Whelan, S. (2002). A novel use of equilibrium frequencies in models of sequence evolution. Molecular Biology and Evolution, 19, 1821–1831.
Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732.
Green, P., Ewing, B., Miller, W., Thomas, P. J., & Green, E. D. (2003). Transcription-associated mutational asymmetry in mammalian evolution. Nature Genetics, 33(4), 514–517.
Hasegawa, M., Kishino, H., & Yano, T. (1985). Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution, 22(2), 160–174.
Hernandez, R. D., Williamson, S. H., & Bustamante, C. D. (2007). Context dependence, ancestral misidentification, and spurious signatures of natural selection. Molecular Biology and Evolution, 24(8), 1792–1800.
Hess, S. T., Blake, J. D., & Blake, R. D. (1994). Wide variations in neighbor-dependent substitution rates. Journal of Molecular Biology, 236(4), 1022–1033.
Hobolth, A. (2008). A Markov chain Monte Carlo expectation maximization algorithm for statistical analysis of DNA sequence evolution with neighbour-dependent substitution rates. Journal of Computer and Graphical Statistics, 17, 138–164.
Huelsenbeck, J. P., Ronquist, F., Nielsen, R., & Bollback, J. P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 294(5550), 2310–2314.
Hwang, D. G., & Green, P. (2004). Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proceedings of National Academic Science USA, 101(39), 13994–14001.
Jeffreys, H. (1935). Some tests of significance treated by theory of probability. Proceedings of the Cambridge Philosophical Society, 31, 203–222.
Jensen, J. L., & Pedersen, A.-M. K. (2000). Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Advances in Applied Probability, 32, 499–517.
Jojic, V., Jojic, N., Meek, C., Geiger, D., Siepel, A., Haussler, D., et al. (2004). Efficient approximations for learning phylogenetic HMM models from data. Bioinformatics, 20(Suppl 1), i161–i168.
Jukes, T. H., & Cantor, C. R. (Eds.). (1969). Evolution of protein molecules (pp. 21–123). Academic Press: New York.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of American Statistical Association, 90, 773–795.
Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16(2), 111–120.
Lartillot, N., & Philippe, H. (2006). Computing Bayes factors using thermodynamic integration. Systematic Biology, 55(2), 195–207.
Lepage, T., Bryant, D., Philippe, H., & Lartillot, N. (2007). A general comparison of relaxed molecular clock models. Molecular Biology and Evolution, 24(12), 2669–2680.
Lio, P., & Goldman, N. (1998). Models of molecular evolution and phylogeny. Genome Research, 8(12), 1233–1244.
Lunter, G., & Hein, J. (2004). A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics, 20(Suppl 1), i216–i223.
Margulies, E. H., Chen, C. W., & Green, E. D. (2006). Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends in Genetics, 22(4), 187–193.
Mendelman, L. V., Boosalis, M. S., Petruska, J., & Goodman, M. F. (1989). Nearest neighbor influences on DNA polymerase insertion fidelity. Journal of Biological Chemistry, 264(24), 14415–14423.
Mighell, A. J., Smith, N. R., Robinson, P. A., & Markham, A. F. (2000). Vertebrate pseudogenes. FEBS Letter, 468(2–3), 109–114.
Minin, V., Abdo, Z., Joyce, P., & Sullivan, J. (2003). Performance-based selection of likelihood models for phylogeny estimation. Systematic Biology, 52(5), 674–683.
Miyamoto, M. M., Slightom, J. L., & Goodman, M. (1987). Phylogenetic relations of humans and African apes from DNA sequences in the psi eta-globin region. Science, 238(4825), 369–373.
Moreira, D., & Philippe, H. (2000). Molecular phylogeny: Pitfalls and progress. International Microbiology, 3(1), 9–16.
Morton, B. R. (1995). Neighboring base composition and transversion/transition bias in a comparison of rice and maize chloroplast noncoding regions. Proceedings of National Academic Science USA, 92(21), 9717–9721.
Morton, B. R. (2003). The role of context-dependent mutations in generating compositional and codon usage bias in grass chloroplast DNA. Journal of Molecular Evolution, 56(5), 616–629.
Morton, B. R., Bi, I. V., McMullen, M. D., & Gaut, B. S. (2006). Variation in mutation dynamics across the maize genome as a function of regional and flanking base composition. Genetics, 172(1), 569–577.
Morton, B. R., & Clegg, M. T. (1995). Neighboring base composition is strongly correlated with base substitution bias in a region of the chloroplast genome. Journal of Molecular Evolution, 41(5), 597–603.
Morton, B. R., Oberholzer, V. M., & Clegg, M. T. (1997). The influence of specific neighboring bases on substitution bias in noncoding regions of the plant chloroplast genome. Journal of Molecular Evolution, 45(3), 227–231.
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computer Graphical Statistics, 9, 249–265.
Nevarez, P. A., DeBoever, C. M., Freeland, B. J., Quitt, M. A., & Bush, E. C. (2010). Context dependent substitution biases vary within the human genome. BMC Bioinformatics, 11, 462.
Newton, M. A., & Raftery, A. E. (1994). Approximating Bayesian inference with the weigthed likelihood bootstrap. Journal of the Royal Statistical Society: Series B, 56, 3–48.
Nielsen, R. (2002). Mapping mutations on phylogenies. Systematic Biology, 51(5), 729–739.
Nylander, J. A., Ronquist, F., Huelsenbeck, J. P., & Nieves-Aldrey, J. L. (2004). Bayesian phylogenetic analysis of combined data. Systematic Biology, 53(1), 47–67.
Ogata, Y. (1989). A Monte Carlo method for high dimensional integration. Numerical Mathematics, 55, 137–157.
Parisi, G., & Echave, J. (2001). Structural constraints and emergence of sequence patterns in protein evolution. Molecular Biology and Evolution, 18(5), 750–756.
Posada, D. (2003). In A. M. Vandamme & M. Salemi (Eds.), The phylogenetic handbook (pp. 256–282). Cambridge University Press.
Posada, D., & Buckley, T. R. (2004). Model selection and model averaging in phylogenetics: Advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Systematic Biology, 53(5), 793–808.
Posada, D., & Crandall, K. A. (1998). MODELTEST: Testing the model of DNA substitution. Bioinformatics, 14(9), 817–818.
Prasad, A. B., Allard, M. W., & Green, E. D. (2008). Confirming the phylogeny of mammals by use of large comparative sequence data sets. Molecular Biology and Evolution, 25(9), 1795–1808.
Raftery, A. E., & Lewis, S. M. (1992). [Practical Markov chain Monte Carlo]: Comment: One long run with diagnostics: Implementation strategies for Markov chain Monte Carlo. Statistical Science, 7, 493–497.
Ramsahoye, B. H., Biniszkiewicz, D., Lyko, F., Clark, V., Bird, A. P., & Jaenisch, R. (2000). Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proceedings of National Academic Science USA, 97(10), 5237–5242.
Rannala, B., & Yang, Z. (1996). Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. Journal of Molecular Evolution, 43(3), 304–311.
Robinson, D. M., Jones, D. T., Kishino, H., Goldman, N., & Thorne, J. L. (2003). Protein evolution with dependence among codons due to tertiary structure. Molecular Biology and Evolution, 20(10), 1692–1704.
Rodrigue, N., Lartillot, N., Bryant, D., & Philippe, H. (2005). Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene, 347(2), 207–217.
Rodrigue, N., Philippe, H., & Lartillot, N. (2006). Assessing site-interdependent phylogenetic models of sequence evolution. Molecular Biology and Evolution, 23(9), 1762–1775.
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12, 1151–1172.
Sanderson, M. J., & Kim, J. (2000). Parametric phylogenetics? Systematic Biology, 49(4), 817–829.
Schadt, E. E., Sinsheimer, J. S., & Lange, K. (1998). Computational advances in maximum likelihood methods for molecular phylogeny. Genome Research, 8(3), 222–233.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman & Hall/CRC.
Schoniger, M., & von Haeseler, A. (1994). A stochastic model for the evolution of autocorrelated DNA sequences. Molecular Phylogenetics and Evolution, 3(3), 240–247.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–466.
Siepel, A., & Haussler, D. (2004a). Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology, 11, 413–428.
Siepel, A., & Haussler, D. (2004b). Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution, 21(3), 468–488.
Steel, M. A. (2005). Should phylogenetic models be trying to ‘fit an elephant’? Trends in Genetics, 21, 307–309.
Suchard, M. A., Kitchen, C. M., Sinsheimer, J. S., & Weiss, R. E. (2003). Hierarchical phylogenetic models for analyzing multipartite sequence data. Systematic Biology, 52(5), 649–664.
Suchard, M. A., Weiss, R. E., & Sinsheimer, J. S. (2001). Bayesian selection of continuous-time Markov chain evolutionary models. Molecular Biology and Evolution, 18(6), 1001–1013.
Sullivan, J., & Joyce, P. (2005). Model selection in phylogenetics. Annual Review of Ecology, Evolution, and Systematics, 36, 445–466.
Sullivan, J., & Swofford, D. L. (2001). Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Systematic Biology, 50(5), 723–729.
Tajima, F., & Nei, M. (1984). Estimation of evolutionary distance between nucleotide sequences. Molecular Biology and Evolution, 1(3), 269–285.
Takahata, N., & Kimura, M. (1981). A model of evolutionary base substitutions and its application with special reference to rapid change of pseudogenes. Genetics, 98(3), 641–657.
Tamura, K., & Nei, M. (1993). Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution, 10(3), 512–526.
Tavaré, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. In R. M. Miura (Ed.), American Mathematical Society: Lectures on mathematics in the life sciences (Vol. 17, pp. 57–86). Providence, RI: American Mathematical Society.
Thompson, M. B. (2010). A comparison of methods for computing autocorrelation time. University of Toronto. Report nr 1007.
Xie, W., Lewis, P. O., Fan, Y., Kuo, L., & Chen, M. H. (2011). Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology, 60(2), 150–160.
Yang, Z. (1994). Estimating the pattern of nucleotide substitution. Journal of Molecular Evolution, 39(1), 105–111.
Yang, Z. (1995). A space-time process model for the evolution of DNA sequences. Genetics, 139(2), 993–1005.
Yang, Z. (1996). Among-site rate variation and its impact on phylogenetic analyses. Trends in Ecology & Evolution, 11(9), 367–372.
Yang, Y. W., Chen, Y., & Li, W. H. (2002). The influence of adjacent nucleotides on the pattern of nucleotide substitution in mitochondrial introns of angiosperms. Journal of Molecular Evolution, 55(1), 111–115.
Yang, Z., & Rannala, B. (1997). Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo Method. Molecular Biology and Evolution, 14(7), 717–724.
Yang, Z., & Roberts, D. (1995). On the use of nucleic acid sequences to infer early branchings in the tree of life. Molecular Biology and Evolution, 12, 451–458.
Yu, J., & Thorne, J. L. (2006). Dependence among sites in RNA evolution. Molecular Biology and Evolution, 23(8), 1525–1537.
Zhao, Z., & Boerwinkle, E. (2002). Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Research, 12(11), 1679–1686.
Zheng, T., Ichiba, T., & Morton, B. R. (2007). Assessing substitution variation across sites in grass chloroplast DNA. Journal of Molecular Evolution, 64(6), 605–613.
Acknowledgments
I would like to thank Stijn Vansteelandt for helpful discussions regarding calculation of the decorrelation times. The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013)/ERC Grant agreement no. 260864.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Baele, G. Context-Dependent Evolutionary Models for Non-Coding Sequences: An Overview of Several Decades of Research and an Analysis of Laurasiatheria and Primate Evolution. Evol Biol 39, 61–82 (2012). https://doi.org/10.1007/s11692-011-9139-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11692-011-9139-2