Abstract
The composition of naturally occurring DNA sequences is often strikingly heterogeneous. In this paper, the DNA sequence is viewed as a stochastic process with local compositional properties determined by the states of a hidden Markov chain. The model used is a discrete-state, discreteoutcome version of a general model for non-stationary time series proposed by Kitagawa (1987). A smoothing algorithm is described which can be used to reconstruct the hidden process and produce graphic displays of the compositional structure of a sequence. The problem of parameter estimation is approached using likelihood methods and an EM algorithm for approximating the maximum likelihood estimate is derived. The methods are applied to sequences from yeast mitochondrial DNA, human and mouse mitochondrial DNAs, a human X chromosomal fragment and the complete genome of bacteriophage lambda.
Similar content being viewed by others
Literature
Anderson, S., A. T. Bankier, B. G. Barrell, M. H. L. de Bruijn, A. R. Coulson, J. Drouin, I. C. Eperon, D. P. Nierlich, B. A. Roe, F. Sanger, P. H. Schreier, A. J. H. Smith, R. Staden and I. G. Young. 1981. “Sequence and Organization of the Human Mitochondrial Genome.”Nature 290, 457–464.
Baum, L. E., T. Petrie, G. Soules, N. Weiss. 1970. “A Maximization Technique Occurring in the Statistical Analysis of Probabalistic Functions of Markov Chains.”Ann. Math. Statist. 41, 164–171.
Becker, R. A. and J. M. Chambers. 1984.S—An Interactive Environment for Data Analysis. Belmont, CA: Wadsworth.
Bernardi, G. and G. Bernardi. 1986. “Compositional Constraints and Genome Evolution.”J. Molec. Evol. 24, 1–11.
—, B. Olofsson, J. Filipski, M. Zerial, G. Cuny, M. Meunier-Rotival, F. Rodier. 1985. “The Mosaic Genome of Warm Blooded Vertebrates.”Science 228, 953–957.
Bibb, M. J., R. A. Van Etten, C. T. Wright, M. W. Walberg, D. A. Clayton. 1981. “Sequence and Gene Organization of Mouse Mitochondrial DNA.”Cell 26, 167–180.
Blanc, H. and B. Dujon. 1980. “Replicator Regions of the Yeast Mitochondrial DNA Responsible for Suppressiveness.”Proc. Natn. Acad. Sci. U.S.A. 77, 3942–3946.
de Zamaroczy, M., G. Bernardi. 1986. “The Primary Structure of the Mitochondrial Genome ofSaccharomyces cerevisiae—a review.”Gene 47, 155–177.
Elton, R. A. 1974. “Theoretical Models for Heterogeneity of Base Composition in DNA.”J. Theor. Biol. 45, 533–553.
Dempster, A. P., N. M. Laird, D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.”J. R. Statist. Soc. B39, 1–22.
Fangman, W. L. and B. Dujon. 1984. “Yeast Mitochondrial Genomes Consisting of Only AT Base Pairs Replicate and Exhibit Suppressiveness.”Proc. Natn. Acad. Sci. U.S.A. 81, 7156–7160.
Goursot, R., M. Mangin, G. Bernardi. 1982. “Surrogate, Origins of Replication in the Mitochondrial Genomes ofori o Petite Mutants of Yeast.”EMBO J. 1, 705–711.
Hinckley, D. V. 1970. “Inference About the Change Point in a Sequence of Random Variables.”Biometrika 57, 1–17.
Kitagawa, G. 1987. “Non-Gaussian State-Space Modeling of Nonstationary Time Series.”J. Am. Statist. Assoc. 82, 1032–1041.
Ott, G. 1967. “Compact Encoding of Stationary Markov Sources.”IEEE Trans. Inf. Theor. IT-13, 82–86.
Riley, D. E., R. Reeves, S. M. Gartler, 1986. “Xrep, a Plasmid-Stimulating X Chromosomal Sequence Bearing Similarities to the BK Virus Replication Origin and Viral Enhancers.”Nucl. Acids Res. 14, 9407–9423.
Sanger, F., A. R. Coulson, G. F. Hong, D. F. Hill, G. B. Petersen. 1982. “Nucleotide Sequence of Bacteriophage λ DNA.”J. Molec. Biol. 162, 729–773.
Schwarz, G. 1978. “Estimating the Dimension of a Model.”Ann. Statist. 6, 461–464.
Skalka, A., E. Burgi, A. D. Hershey. 1968. “Segmental Distribution of Nucleotides in the DNA of Bacteriophage Lambda.”J. Molec. Biol. 34, 1–16.
Smith, A. F. M. 1975. “A Baysean Approach to Inference About a Change Point in a Sequence of Random Variables.”Biometrika 62, 407–416.
Staden, R. 1984. “Graphic Methods to Determine the Function of Nucleic Acid Sequences.”Nucl. Acids Res. 12, 521–538.
Sueoka, N. 1959. “A Statistical Analysis of Deoxyribonucleic Acid Distribution in Density Gradient Centrifugation.”Proc. Natn. Acad. Sci. U.S.A. 45, 1480–1490.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Churchill, G.A. Stochastic models for heterogeneous DNA sequences. Bltn Mathcal Biology 51, 79–94 (1989). https://doi.org/10.1007/BF02458837
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF02458837