# Minimal entropy probability paths between genome families

- 62 Downloads

## Abstract.

We develop a metric for probability distributions with applications to biological sequence analysis. Our distance metric is obtained by minimizing a functional defined on the class of paths over probability measures on *N* categories. The underlying mathematical theory is connected to a constrained problem in the calculus of variations. The solution presented is a numerical solution, which approximates the true solution in a set of cases called *rich paths* where none of the components of the path is zero. The functional to be minimized is motivated by entropy considerations, reflecting the idea that nature might efficiently carry out mutations of genome sequences in such a way that the increase in entropy involved in transformation is as small as possible. We characterize sequences by frequency profiles or probability vectors, in the case of DNA where *N* is 4 and the components of the probability vector are the frequency of occurrence of each of the bases A, C, G and T. Given two probability vectors **a** and **b**, we define a distance function based as the infimum of path integrals of the entropy function *H*(*p*) over all admissible paths *p*(*t*), 0 ≤*t*≤1, with *p*(*t*) a probability vector such that *p*(0)=**a** and *p*(1)=**b**. If the probability paths *p*(*t*) are parameterized as *y*(*s*) in terms of arc length *s* and the optimal path is smooth with arc length *L*, then smooth and ‘‘rich’’ optimal probability paths may be numerically estimated by a hybrid method of iterating Newton’s method on solutions of a two point boundary value problem, with unknown distance *L* between the abscissas, for the Euler–Lagrange equations resulting from a multiplier rule for the constrained optimization problem together with linear regression to improve the arc length estimate *L*. Matlab code for these numerical methods is provided which works only for ‘‘rich’’ optimal probability vectors. These methods motivate a definition of an elementary distance function which is easier and faster to calculate, works on non–rich vectors, does not involve variational theory and does not involve differential equations, but is a better approximation of the minimal entropy path distance than the distance ||**b**−**a**||_{2}. We compute minimal entropy distance matrices for examples of DNA myostatin genes and amino-acid sequences across several species. Output tree dendograms for our minimal entropy metric are compared with dendograms based on BLAST and BLAST identity scores.

### Keywords

ACGT sequences Entropy Probability vectors Probability paths Distance between genome families Constrained variational problems Euler-Lagrange multiplier rules## Preview

Unable to display preview. Download preview PDF.

### References

- 1.Altschul, S.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Evol.
**219**, 555–565 (1991)Google Scholar - 2.Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol.
**215**, 403–410 (1990)CrossRefPubMedGoogle Scholar - 3.Arslan, A., Egecioglu, O., Pevzner, P.: A new approach to sequence comparison: normalized sequence alignment. BioInformatics
**17**, 327–337 (2002)Google Scholar - 4.Benson, G.: A new distance measure for comparing sequence profiles based on paths along an entropy surface. Proceedings of the European Conference on Computational Biology (ECCB 2002), October 6-9, 2002, Saarbrücken, Germany, Bioinformatics, 18 (Supplement 2), Oxford University Press, 2002, pp. S44–S53Google Scholar
- 5.Bruno, W., Socci, N., Halpern, A.: Weighted Neighbor Joining: A Fast Approximation to Maximum-Likelihood Phylogeny Reconstruction. Molecular Biology and Evolution
**17**(1), 189–197 (2000)Google Scholar - 6.Edwards, A.W.F., Cavalli-Sforza, C.C.: Reconstruction of evolutionary trees. In: V.H. Heywood and J. McNeill, (eds.), Phenetic and Phylogenetic Classification. Systematics Association, London, UK, 1964Google Scholar
- 7.Ewing, G.M.: Calculus of Variations with Applications. W. W. Norton, New York, 1969Google Scholar
- 8.Feng, D., Johnson, M., Doolittle, R.: Aligning amino acid sequences: comparison of commonly used methods. J. Mol. Evol.
**2**, 434–447 (1985)Google Scholar - 9.Karp, R., Rabin, M.: Efficient randomized pattern–matching algorithms. IBM J. Res. Dev.
**31**(2), 249–260 (1987)MATHGoogle Scholar - 10.Kullback, S.: Information Theory and Statistics. Dover Publications, 1968Google Scholar
- 11.Leon, S.J.: Linear Algebra with Applications. Fourth Edition. Prentice–Hall. Upper Saddle River, New Jersey, 1994Google Scholar
- 12.Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theor.
**37**, 145–151 (1991)CrossRefMATHGoogle Scholar - 13.Nei, M., Tajima, F., Tateno, Y.: Accuracy of estimated phylogenetic trees from molecular data: II. gene frequency data. J. Molec. Ecol.
**19**, 153–170 (1983)Google Scholar - 14.Shampine, L.F., Reichelt, M.W.: The MATLAB ODE Suite. SIAM J. Scientific Com.
**18**, 1–22 (1997)CrossRefMathSciNetMATHGoogle Scholar - 15.Simon, B.: Spectral analysis of rank one perturbations and applications, Mathematical Quantum Theory, II. Schrödinger Operators (Vancouver, BC, 1993), CRM Proc. Lecture Notes,
**8**, Amer. Math. Soc., Providence, RI, 1995, pp. 109–149, MR 97c:47008Google Scholar - 16.Stewart, G.W.: Introduction to Matrix Computations. Academic Press, New York, 1973Google Scholar
- 17.Thomas, G.B., Finney, R.L.: Calculus and Analytic Geometry. 8th edition. Addison–Wesley Publishing Co., New York, 1992Google Scholar
- 18.Wilkinson, J.H.: Sensitivity of eigenvalues. Utilitas Math.
**25**5–76 (1984) and part II, ibid.**30**, 243–286 (1986)Google Scholar - 19.Wong, A.K.C., Chan, S.C., Ch, D.K.Y.: A multiple sequence comparison method. Bull. Math. Biol.
**55**, 465–486 (1993)MATHGoogle Scholar