Abstract
A new method to compare two (or several) symbol sequences is developed. The method is based on the comparison of the frequencies of the small fragments of the compared sequences; it requires neither string editing, nor other transformations of the compared objects. The comparison is executed through a calculation of the specific entropy of a frequency dictionary against the special dictionary called the hybrid one; this latter is the statistical ancestor of the group of sequences under comparison. Some applications of the developed method in the fields of genetics and bioinformatics are discussed.
Similar content being viewed by others
References
Alexandrov, A. A., V. V. Alexandrov, Yu. M. Borodovsky and A. V. Mironov (1990). Computer Analysis of Genetic Texts, Moscow: Nauka.
Balesçu, R. (1975). Equilibrium and Nonequilibrium Statistical Mechanics, Vol. xiv, New York: Wiley, pp. 742.
Bugaenko, N. N., A. N. Gorban and M. G. Sadovsky (1998). Maximum entropy method in analysis of genetic text and measurement of its information content. Open Syst. Inf. Dyn. 5, 265–281.
Bugaenko, N. N., M. G. Sadovsky and A. N. Sapozhnikov (1997). Classification of symbols and implementation of the alphabet optimal for the purposes of a revealing of statistical regularities in a text, in 5th National Conference on “Neuroinformatics and its Applications”, Krasnoyarsk, 22–25 September, 1997, pp. 28–30.
Gorban, A. N. (1984). A By-pass of Equilibrium, Novosibirsk: Nauka, pp. 256.
Gorban, A. N., T. G. Popova and M. G. Sadovsky (1998). Automatic classification of nucleotide sequences and its relation to natural taxonomy and protein function, in Proceedings of the 1st International Conference on Bioinformatics of Genome Regulation and Structure, Novosibirsk, 24–27 August, 1998; Vol. II, pp. 314–317.
Gorban, A. N., T. G. Popova and M. G. Sadovsky (2000). Classification of symbol sequences over their frequency dictionaries: towards the connection between structure and natural taxonomy. Open Syst. Inf. Dyn. 7, 1–17.
Gorbunova, E. O., Yu. V. Kondratenko and M. G. Sadovsky (2002a). Implementation of the parallel paradigm of Kirdin kinetic machine for data loss recovery, in Proceedings of the 2nd Workshop “Cluster and Distributed Computations”, Krasnoyarsk, KSTU, pp. 47–53.
Gorbunova, E. O., Yu. V. Kondratenko and M. G. Sadovsky (2002b). Reconstruction of the data loss due to Kirdin kinetic machine, in Proceedings of the 10th All-Russian Conference on “Neuroinformatics and its Application”, Krasnoyarsk, A. N. Gorban (Ed.), pp. 46–47.
Just, W. (2001). Computational complexity of multiple sequence alignment with SP-score. J. Comput. Biol. 8, 615–623.
Kareva, M. V. and M. G. Sadovsky (2001). Entropy methods for some linguistics problems, in Proceedings of the 9th All-Russian Conference on “Neuroinformatics and its Applications”, Krasnoyarsk, A. N. Gorban (Ed.), 5–7 October, 2001, Krasnoyarsk, 2001, pp. 74–75.
Kirsanova, E. N. and M. G. Sadovsky (2001). Entropy approach to a comparison of images. Open Syst. Inf. Dyn. 8, 183–199.
Sadovsky, M. G. (2002a). Comparison of symbol sequences: no editing, no alignment. Open Syst. Inf. Dyn. 8, 123–132.
Sadovsky, M. G. (2002b). Information capacity of symbol sequences. Open Syst. Inf. Dyn. 9, 231–247.
Sankoff, D. (1992). Edit distance for genome comparison based on non-local operations. Comb. Pattern Match Lect. Notes Comput. Sci. 644, 121–135.
Sankoff, D. and J. H. Nadeau (2000). Comparative Genomics Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families, Dordrecht: Kluwer Academic Publishers.
Seledtzov, I. A., Yu. I. Wulf and K. S. Makarova (1995). Multiple alignment of biopolymer sequences based on the statistically significant common sites. Russian J. Mol. Gen. 29, 1023–1039.
Waterman, M. (ed.) (1989). Alignment of Sequences, Boca Raton: CRC Press Inc.
Wootton, J. C. and S. Federchen (1996). Alignment of sequences, in Methods of Enzymology, R. F. Doolitle (Ed.), pp. 554–571.
Yu, Z. G., V. V. Anh and B. Wang (2001). Correlation property of length sequences based on global structure of the complete genome. Phys. Rev. E 63, 011903.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Sadovsky, M.G. The method to compare nucleotide sequences based on the minimum entropy principle. Bull. Math. Biol. 65, 309–322 (2003). https://doi.org/10.1016/S0092-8240(02)00107-6
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1016/S0092-8240(02)00107-6