Skip to main content

Advertisement

Log in

Minimal entropy probability paths between genome families

  • Published:
Journal of Mathematical Biology Aims and scope Submit manuscript

Abstract.

We develop a metric for probability distributions with applications to biological sequence analysis. Our distance metric is obtained by minimizing a functional defined on the class of paths over probability measures on N categories. The underlying mathematical theory is connected to a constrained problem in the calculus of variations. The solution presented is a numerical solution, which approximates the true solution in a set of cases called rich paths where none of the components of the path is zero. The functional to be minimized is motivated by entropy considerations, reflecting the idea that nature might efficiently carry out mutations of genome sequences in such a way that the increase in entropy involved in transformation is as small as possible. We characterize sequences by frequency profiles or probability vectors, in the case of DNA where N is 4 and the components of the probability vector are the frequency of occurrence of each of the bases A, C, G and T. Given two probability vectors a and b, we define a distance function based as the infimum of path integrals of the entropy function H(p) over all admissible paths p(t), 0 ≤t≤1, with p(t) a probability vector such that p(0)=a and p(1)=b. If the probability paths p(t) are parameterized as y(s) in terms of arc length s and the optimal path is smooth with arc length L, then smooth and ‘‘rich’’ optimal probability paths may be numerically estimated by a hybrid method of iterating Newton’s method on solutions of a two point boundary value problem, with unknown distance L between the abscissas, for the Euler–Lagrange equations resulting from a multiplier rule for the constrained optimization problem together with linear regression to improve the arc length estimate L. Matlab code for these numerical methods is provided which works only for ‘‘rich’’ optimal probability vectors. These methods motivate a definition of an elementary distance function which is easier and faster to calculate, works on non–rich vectors, does not involve variational theory and does not involve differential equations, but is a better approximation of the minimal entropy path distance than the distance ||ba||2. We compute minimal entropy distance matrices for examples of DNA myostatin genes and amino-acid sequences across several species. Output tree dendograms for our minimal entropy metric are compared with dendograms based on BLAST and BLAST identity scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Altschul, S.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Evol. 219, 555–565 (1991)

    Google Scholar 

  2. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)

    Article  CAS  PubMed  Google Scholar 

  3. Arslan, A., Egecioglu, O., Pevzner, P.: A new approach to sequence comparison: normalized sequence alignment. BioInformatics 17, 327–337 (2002)

    Google Scholar 

  4. Benson, G.: A new distance measure for comparing sequence profiles based on paths along an entropy surface. Proceedings of the European Conference on Computational Biology (ECCB 2002), October 6-9, 2002, Saarbrücken, Germany, Bioinformatics, 18 (Supplement 2), Oxford University Press, 2002, pp. S44–S53

  5. Bruno, W., Socci, N., Halpern, A.: Weighted Neighbor Joining: A Fast Approximation to Maximum-Likelihood Phylogeny Reconstruction. Molecular Biology and Evolution 17 (1), 189–197 (2000)

    Google Scholar 

  6. Edwards, A.W.F., Cavalli-Sforza, C.C.: Reconstruction of evolutionary trees. In: V.H. Heywood and J. McNeill, (eds.), Phenetic and Phylogenetic Classification. Systematics Association, London, UK, 1964

  7. Ewing, G.M.: Calculus of Variations with Applications. W. W. Norton, New York, 1969

  8. Feng, D., Johnson, M., Doolittle, R.: Aligning amino acid sequences: comparison of commonly used methods. J. Mol. Evol. 2, 434–447 (1985)

    Google Scholar 

  9. Karp, R., Rabin, M.: Efficient randomized pattern–matching algorithms. IBM J. Res. Dev. 31 (2), 249–260 (1987)

    MATH  Google Scholar 

  10. Kullback, S.: Information Theory and Statistics. Dover Publications, 1968

  11. Leon, S.J.: Linear Algebra with Applications. Fourth Edition. Prentice–Hall. Upper Saddle River, New Jersey, 1994

  12. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theor. 37, 145–151 (1991)

    Article  MATH  Google Scholar 

  13. Nei, M., Tajima, F., Tateno, Y.: Accuracy of estimated phylogenetic trees from molecular data: II. gene frequency data. J. Molec. Ecol. 19, 153–170 (1983)

    Google Scholar 

  14. Shampine, L.F., Reichelt, M.W.: The MATLAB ODE Suite. SIAM J. Scientific Com. 18, 1–22 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  15. Simon, B.: Spectral analysis of rank one perturbations and applications, Mathematical Quantum Theory, II. Schrödinger Operators (Vancouver, BC, 1993), CRM Proc. Lecture Notes, 8, Amer. Math. Soc., Providence, RI, 1995, pp. 109–149, MR 97c:47008

  16. Stewart, G.W.: Introduction to Matrix Computations. Academic Press, New York, 1973

  17. Thomas, G.B., Finney, R.L.: Calculus and Analytic Geometry. 8th edition. Addison–Wesley Publishing Co., New York, 1992

  18. Wilkinson, J.H.: Sensitivity of eigenvalues. Utilitas Math. 25 5–76 (1984) and part II, ibid. 30, 243–286 (1986)

  19. Wong, A.K.C., Chan, S.C., Ch, D.K.Y.: A multiple sequence comparison method. Bull. Math. Biol. 55, 465–486 (1993)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Calvin Ahlbrandt.

Additional information

Mathematics Subject Classification (2000): 92B05, 92D20

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahlbrandt, C., Benson, G. & Casey, W. Minimal entropy probability paths between genome families. J. Math. Biol. 48, 563–590 (2004). https://doi.org/10.1007/s00285-003-0248-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00285-003-0248-0

Keywords

Navigation