Abstract
An original spectral-statistical approach for detecting latent periodicity in biological sequences is proposed. This approach can be applied under conditions of limited statistical sample. It allows one to avoid redundancy and instability when identifying the latent periodicity structure. The optimality of the periodicity-pattern-size estimates obtained for approximate tandem repeats on the basis of the spectral-statistical approach is demonstrated in practical examples.
Similar content being viewed by others
References
G. Benson, “Tandem Repeats Finder: A Program to Analyze DNA Sequences,” Nucl. Acids Res. 27, 573 (1999).
G. Benson, “A New Distance Measure for Comparing Sequence Profiles Based on Path Length along an Entropy Surface,” Bioinformatics 18, S44 (2002).
R. Kolpakov, G. Bana, and G. Kucherov, “MREPS: Efficient and Flexible Detection of Tandem Repeats in DNA,” Nucl. Acids Res. 31, 3672 (2003).
V. Boeva, M. Regnier, D. Papatsenko, et al., “Short Fuzzy Tandem Repeats in Genomic Sequences, Identification, and Possible Role in Regulation of Gene Expression,” Bioinformatics 22, 676 (2004).
F. Denoeud and G. Vergnaud, “Identification of Polymorphic Tandem Repeats by Direct Comparison of Genome Sequence from Different Bacterial Strains: A Web-Based Resource,” BMC Bioinformatics 5, 4 (2004).
P. Le Fleche, Y. Hauck, L. Onteniente, et al., “A Tandem Repeats Database for Bacterial Genomes: Application to the Genotyping of Yersinia Pestis and Bacillus Anthracis,” BMC Microbiol 1, 2 (2001).
T. Body, A. M. Patch, and S. J. Aves, “TRbase: A Database Relating Tandem Repeats to Disease Genes for the Human Genome,” Bioinformatics 21, 811 (2005).
P. I. Missirlis, C. L. Mead, S. L. Butland, et al., “Satellog: A Database for the Identification and Prioritization of Satellite Repeats in Disease Association Studies,” BMC Bioinformatics 6, 145 (2005).
P. Siwach, S. D. Pophaly, and S. Ganesh, “Genomic and Evolutionary Insights into Genes Encoding Proteins with Single Amino Acid Repeats,” Mol. Biol. Evol. 23, 1357 (2006).
M. V. Katti, R. Sami-Subbu, P. K. Ranjekar, et al., “Amino Acid Repeat Patterns in Protein Sequences: Their Diversity and Structural-Functional Implications,” Protein Sci. 9, 1203–1209 (2000).
P. Tompa, “Intrinsically Unstructured Proteins Evolve by Repeat Expansion,” Bioessays 25, 847 (2003).
M. K. Kalita, G. Ramasamy, S. Duraisamy, et al., “ProtRepeatsDB: A Database of Amino Acid Repeats in Genomes,” BMC Bioinformatics 7, 336 (2006).
V. P. Turutina, A. A. Laskin, N. A. Kudryashov, et al., “Identification of Amino Acid Latent Periodicity within 94 Protein Families,” J. Comput. Biol 13, 946 (2006).
A. T. Castelo, W. Martins, and G. R. Gao, “TROLL—Tandem Repeat Occurrence Locator,” Bioinformatics 18, 634 (2002).
A. M. Hauth and D. A. Joseph, “Beyond Tandem Repeats: Complex Pattern Structures and Distant Regions of Similarity,” Bioinformatics 18, S31 (2002).
M. J. Shulman, C. M. Steinberg, and N. Westmoreland, “The Coding Function of Nucleotide Sequences Can Be Discerned by Statistical Analysis,” J. Theor. Biol. 88,409 (1981).
E. V. Korotkov, M. A. Korotkova, and N. A. Kudryashov, “Information Decomposition Method to Analyze Symbolical Sequences,” Phys. Lett. A 312, 198 (2003).
M. A. Korotkova, E. V. Korotkov, and V. M. Rudenko, “Latent Periodicity in Protein Sequences,” J. Mol. Model. 5, 103 (1999).
D. Gatherer and N. McEwan, “Analysis of Sequence Periodicity in E. coli Proteins,” J. Mol. Evol. 57, 149–158 (2003).
A. Shelenkov, K. Skryabin, and E. Korotkov, “Search and Classification of Potential Minisatellite Sequences from Bacterial Genomes,” DNA Res. 13, 89–102 (2006).
M. B. Chaley, E. V. Korotkov, and K. G. Skryabin, “Method Revealing Latent Periodicity of the Nucleotide Sequences Modified for a Case of Small Samples,” DNA Res. 6, 153 (1999).
B. D. Silverman and R. Linsker, “A Measure of DNA Periodicity,” J. Theor. Biol. 118, 295 (1986).
D. Sharma, B. Issac, G. P. Paghava, et al., “Spectral Repeat Finder (SRF): Identification of Repetitive Sequences Using Fourier Transformation,” Bioinformatics 20, 1405 (2004).
S. L. Marple, Digital Spectral Analysis and Applications (Prentice-Hall, Baltimore, 1987).
M. Altaiski, O. Mornev, and R. Polozov, “Wavelet Analysis of DNA Sequences,” Genet. Anal. 12, 165 (1996).
G. Dodin, P. Vandergheynst, P. Levoir, et al., “Fourier and Wavelet Transform Analysis, a Tool for Visualizing Regular Patterns in DNA Sequences,” J. Theor. Biol. 206, 323 (2000).
G. Landau, J. Schmidt, and D. Sokol, “An Algorithm for Approximate Tandem Repeats,” J. Comp. Biol. 8, 1 (2001).
W. Li, “The Study of Correlation Structures of DNA Sequences: A Critical Review,” Computers Chem. 21, 257 (1997).
H. Cramer, Mathematical Methods of Statistics (Princeton University Press, Princeton, 1999).
M. Chaley, V. Kutyrkin, “Model of Perfect Tandem Repeat with Random Pattern and Empirical Homogeneity Testing Poly-Criteria for Latent Periodicity Revelation in Biological Sequences.” Math. Biosci. 211, 186 (2008).
Author information
Authors and Affiliations
Corresponding author
Additional information
Maria Borisovna Chaley was born in 1963 and graduated from the Moscow Institute of Physics and Technology in 1988 (M.Sc.). She received her PhD in biophysics in 1993 and became a docent in bioinformatics in 2003. At the present time, she is a senior research fellow at the Institute of Mathematical Biology Problems of the Russian Academy of Sciences. Her research interests include bioinformatics, genetic text analysis, and molecular evolution. She is the author of over 40 research publications, including 14 journal articles.
Nafisa Nailovna Nazipova was born in 1960 and graduated from the Department of Computational Mathematics and Cybernetics of Lenin Kazan State University in 1982 (MSc.). She received her PhD in physics and mathematics (mathematic modeling, numerical methods, and program complexes) in 2002. At the present time, she is the head of the bioinformatics laboratory at the Institute of Mathematical Biology Problems of the Russian Academy of Sciences. Her research interests include bioinformatics and the structural and functional organization of genetic sequences. She is the author of over 35 research publications, including 9 papers in refereed journals and 2 book chapters.
Vladimir Andreyevich Kutyrkin was born in 1952 and graduated from the Department of Mechanics and Mathematics of Lomonosov Moscow State University in 1974 (MSc.). He received his PhD in physics and mathematics in 1995. At the present time, he is a docent of Bauman Moscow State Technical University. His research interests include applied mathematical statistics, computational and discrete mathematics, and bioinformatics. He is the author of over 20 research publications, including 11 journal papers.
Rights and permissions
About this article
Cite this article
Chaley, M.B., Nazipova, N.N. & Kutyrkin, V.A. Statistical methods for detecting latent periodicity patterns in biological sequences: The case of small-size samples. Pattern Recognit. Image Anal. 19, 358–367 (2009). https://doi.org/10.1134/S1054661809020217
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1054661809020217