Abstract
The study of the dependence of the mutation frequency in human genome was conducted by the example of a set of documented single nucleotide polymorphisms (SNPs) from the 1000 genomes project. The tasks of the development of new computer methods for the statistical analysis of genetic texts based on estimations of sequences complexity were considered. The application of the complexity profiles in a sliding window to the analysis of sites containing single nucleotide polymorphisms in the human genome was demonstrated. A local decrease in the text complexity near SNPs was established. Based on the analysis of the complexity profiles in the regions containing SNPs, it was demonstrated that the flanking monomer repeats determine the decreased context complexity of single nucleotide polymorphism sites in human genome. The effect of local decrease in the text complexity level for sequences flanking SNP sites was confirmed for the data on polymorphisms in the rat and mouse genomes. Differences in the context organization for coding and regulatory sequences (that are reflected in the text complexity of nucleotide sequences containing SNPs) were determined. Changes in the point mutation frequencies were previously demonstrated for the sequences containing microsatellites. Using more general mathematical apparatus and more complete data, a saturation of local genomic surroundings containing SNPs with polytracts and simple repeated sequences was demonstrated in this work. Oligonucleotides with increased frequency in the genomic SNP surroundings in human were determined; their association with polytracts was demonstrated. The presence of polytracts can indicate a greater probability of a break in the double DNA strand at this point (resulting in an increased frequency of nucleotide substitutions). The obtained estimations were determined by a previously developed complex of computer programs, which allows us to efficiently determine the frequency spectrum of oligonucleotides with a fixed length, to compare nucleotide frequencies in a larger sample (in addition to estimating the complexity of the phased sequences).
Similar content being viewed by others
References
Babenko, V.N., Kosarev, P.S., Vishnevsky, O.V., Levitsky, V.G., Basin, V.V., and Frolov, A.S., Investigating extended regulatory regions of genomic DNA sequences, Bioinformatics, 1999, vol. 15, nos. 7–8, pp. 644–653. doi 10.1093/bioinformatics/15.7.644
Babenko, V.N., Matvienko, V.F., and Safronova, N.S., Implication of transposons distribution on chromatin state and genome architecture in human, J. Biomol. Struct. Dyn., 2015, vol. 33, no. 1, pp. 10–11. doi 10.1080/07391102.2015.1032559
Chuzhanova, N.A., Krawczak, M., Thomas, N., Nemytikova, L.A., Gusev, V.D., and Cooper, D.N., The evolution of the vertebrate beta-globin gene promoter, Evolution, 2002, vol. 56, no. 2, pp. 224–232.
Goh, W.S., Orlov, Y., Li, J., and Clarke, N.D., Blurring of high-resolution data shows that the effect of intrinsic nucleosome occupancy on transcription factor binding is mostly regional, not local, PLoS Comput. Biol., 2010, vol. 6, no. 1. doi 10.1371/journal.pcbi.1000649
Gusev, V.D., Nemytikova, L.A., and Chuzhanova, N.A., On the complexity measures of genetic sequences, Bioinformatics, 1999, vol. 15, no. 12, pp. 994–999. doi 10.1093/bioinformatics/15.12.994
Ignatieva, E.V., Podkolodnaya, O.A., Orlov, Yu.L., Vasiliev, G.V., and Kolchanov, N.A., Regulatory genomics: Combined experimental and computational approaches, Russ. J. Genet., 2015, vol. 51, no. 4, pp. 334–352.
Altshuler, D.M., Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu F., Peltonen, L., Dermitzakis, E., Bonnen, P.E., Altshuler, D.M., Gibbs, R.A., de Bakker, P. I., Deloukas, P., Gabriel, S.B., et al., Integrating common and rare genetic variation in diverse human populations, Nature, 2010, vol. 467, no. 7311, pp. 52–58. doi 10.1038/nature09298
Karlin, S., Ost, F., and Blaisdell, B.T., Patterns in DNA and amino-acid sequences and their statistical significance, in Mathematical Methods for DNA Sequences, Waterman, M.S., Ed., Boca Raton: CRC Press, 1989.
Kulakova, E.V., Spitsina, A.M., Orlova, N.G., Dergilev, A.I., Svichkarev, A.V., Safronova, N.S., Chernykh, I.G., and Orlov, Yu.L., Programs for analysis of genomic sequencing data obtained using technologies ChIP-seq, ChIA-PET, and Hi-C, Program. Sist.: Teor. Prilozh., 2015, vol. 6, no. 2, pp. 129–148.
Lenz, C., Haerty, W., and Golding, G.B., Increased substitution rates surrounding low-complexity regions within primate proteins, Genome Biol. Evol., 2014, vol. 6, no. 3, pp. 655–665. doi 10.1093/gbe/evu042
Medvedeva, S.A., Panchin, A.Y., Alexeevski, A.V., Spirin, S.A., and Panchin, Y.V., Comparative analysis of context-dependent mutagenesis using human and mouse models, BioMed Res. Int., 2013, vol. 2013.
Orlov, Yu.L., Analysis of regulatory genomic sequences using computer methods for estimating the complexity of genetic texts, Cand. Sci. (Biol.) Dissertation, Novosibirsk, 2004.
Orlov, Y.L., Filippov, V.P., Potapov, V.N., and Kolchanov, N.A., Construction of stochastic context trees for genetic texts, In Silico Biol., 2002, vol. 2, no. 3, pp. 257–262.
Orlov, Y.L. and Potapov, V.N., Complexity: An Internet resource for analysis of dna sequence complexity, Nucleic Acids Res., 2004, vol. 32, pp. W628–633. doi 10.1093/nar/gkh466
Orlov, Yu.L., Levitskii, V.G., Smirnova, O.G., Podkolodnaya, O.A., Khlebodarova, T.M., and Kolchanov, N.A., Statistical analysis of DNA sequences containing nucleosome positioning sites, Biophysics, 2006, vol. 51, no. 4, pp. 541–546.
Orlov, Y.L., Te Boekhorst, R., and Abnizova, I.I., Statistical measures of the structure of genomic sequences: Entropy, complexity, and position information, J. Bioinf. Comput. Biol., 2006, vol. 4, pp. 523–536. doi 10.1142/S0219720006001801
Orlov, Yu.L., Bragin, A.O., Medvedeva, I.V., Gunbin, I.V., Demenkov, P.S., Vishnevskii, O.V., Levitskii, V.G., Oshchepkov, V.G., Podkolodnyi, N.L., Afonnikov, D.A., Grosse, I., and Kolchanov, N.A., ICGenomics: Software for analysis of character sequences in genomics, Vavilovskii Zh. Genet. Sel., 2012, vol. 16, no. 4/1, pp. 732–741.
Polanovski, O.L., Lebedenko, E.N., and Deyev, S.M., ERBB oncogene proteins as targets for monoclonal antibodies, Biochemistry (Moscow), 2012, vol. 77, no. 3, pp. 227–245.
Ponomarenko, M., Mironova, V., Gunbin, K., and Savinkova, L., Hogness Box, in Brenner’s Encyclopedia of Genetics, Maloy, S. and Hughes, K., Eds., San Diego: Acad. Press, Elsevier Inc, 2013a, vol. 3, pp. 491–494. doi 10.1016/B978-0-12-374984-0.00720-8
Ponomarenko, M., Savinkova, L., and Kolchanov, N., Initiation Factors, in Brenner’s Encyclopedia of Genetics, Maloy, S. and Hughes, K., Eds., San Diego: Acad. Press, Elsevier Inc, 2013b, vol. 4, pp. 83–85. doi 10.1016/B978-0-12-374984-0.00798-1
Ponomarenko, J.V., Orlova, G.V., Merkulova, T.I., Gorshkova, E.V., Fokin, O.N., Vasiliev, G.V., Frolov, A.S., and Ponomarenko, M.P., rSNP_Guide: An integrated database-tools system for studying SNPs and site-directed mutations in transcription factor binding sites, Hum. Mutat., 2002, vol. 20, no. 4, pp. 239–248. doi 10.1002/humu.10116
Ponomarenko, P.M., Savinkova, L.K., Drachkova, I.A., Lysova, M.V., Arshinova, T.V., Ponomarenko, M.P., and Kolchanov, N.A., A step-by-step model of TBP/TATA box binding allows predicting human hereditary diseases by single nucleotide polymorphism, Dokl. Biochem. Biophys., 2008, vol. 419, no. 1, pp. 88–92.
Putta, P., Orlov, Y.L., Podkolodnyy, N.L., and Mitra, C.K., Relatively conserved common short sequences in transcription factor binding sites and miRNA, Russ. J. Genet., Appl. Res., 2012, vol. 2, no. 3, pp. 238–242. doi 10.1134/S2079059712030094
Rogozin, I.B., Solovyov, V.V., and Kolchanov, N.A., Somatic hypermutagenesis in immunoglobulin genes. I. Correlation between somatic mutations and repeats. Somatic mutation properties and clonal selection, Biochim. Biophys. Acta, 1991, vol. 1089, no. 2, pp. 175–182. doi 10.1016/0167-4781(91)90005-7
Rogozin, I.B. and Kolchanov, N.A., Somatic hypermutagenesis in immunoglobulin genes. II. Influence of neighbouring base sequences on mutagenesis, Biochim. Biophys. Acta, 1992, vol. 1171, no. 1, pp. 11–18. doi 10.1016/0167-4781(92)90134-L
Rogozin, I.B., Pavlov, Y.I., Bebenek, K., Matsuda, T., and Kunkel, T.A., Somatic mutation hotspots correlate with DNA polymerase eta error spectrum, Nat. Immunol., 2001, vol. 2, no. 6, pp. 530–536. doi 10.1038/88732
Safronova, N.S., Babenko, V.N., and Orlov, Y.L., 117 Analysis of SNP containing sites in human genome using text complexity estimates, J. Biomol. Struct. Dyn., 2015, vol. 33, no. 1, pp. 73–74. doi 10.1080/07391102.2015.1032750
Savinkova, L.K., Ponomarenko, M.P., Ponomarenko, P.M., Drachkova, I.A., Lysova, M.V., Arshinova, T.V., and Kolchanov, N.A., TATA box polymorphisms in human gene promoters and associated hereditary pathologies, Biochemistry (Moscow), 2009, vol. 74, no. 2, pp. 117–129.
Siddle, K.J., Goodship, J.A., Keavney, B., and Santibanez-Koref, M.F., Bases adjacent to mononucleotide repeats show an increased single nucleotide polymorphism frequency in the human genome, Bioinformatics, 2011, vol. 27, no. 7, pp. 895–898. doi 10.1093/bioinformatics/btr067
Sidore, C., Busonero, F., Maschio, A., Porcu, E., Naitza, S., Zoledziewska, M., Mulas, A., Pistis, G., Steri, M., Danjou, F., Kwong, A., Ortega Del Vecchyo, V.D., Chiang, C.W., Bragg-Gresham, J., Pitzalis, M., et al., Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers, Nat. Genet., 2015, vol. 47, no. 11, pp. 1272–1281. doi 10.1038/ng.3368
Spitsina, A.M., Orlov, Yu.L., Podkolodnaya, N.N., Svichkarev, A.V., Dergilev, A.I., Chen, M., Kuchin, N.V., Chernykh, I.G., and Glinskii, B.M., Supercomputer analysis of genomic and transcriptomic data obtained using highthroughput DNA sequencing, Program. Sist.: Teor. Prilozh., 2015, vol. 6, no. 23, pp. 157–174.
Trifonov, E.N., Volkovich, Z., and Frenkel, Z.M., Multiple levels of meaning in DNA sequences, and one more, Ann. N. Y. Acad. Sci., 2012, vol. 1267, pp. 35–38. doi 10.1111/j.1749-6632.2012.06589.x
Troyanskaya, O.G., Arbell, O., Koren, Y., Landau, G.M., and Bolshoy, A., Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity, Bioinformatics, 2002, vol. 18, no. 5, pp. 679–688. doi 10.1093/bioinformatics/18.5.679
UK10K Consortium, Walter, K., Min, J.L., Huang, J., Crooks, L., Memari, Y., McCarthy, S., Perry, J.R., Xu, C., Futema, M., Lawson, D., Iotchkova, V., Schiffels, S., Hendricks, A.E., et al., The UK10K project identifies rare variants in health and disease, Nature, 2015, vol. 526, pp. 82–90. doi 10.1038/nature14962
Vowles, E.J. and Amos, W., Evidence for widespread convergent evolution around human microsatellites, PLoS Biol., 2004, vol. 2. doi 10.1371/journal.pbio.0020199
Wootton, J.C. and Federhen, S., Analysis of compositionally biased regions in sequence databases, Methods Enzymol., 1996, vol. 266, pp. 554–571. doi 10.1016/S0076-6879(96)66035-2
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © N.S. Safronova, M.P. Ponomarenko, I.I. Abnizova, G.V. Orlova, I.V. Chadaeva, Y.L. Orlov, 2015, published in Vavilovskii Zhurnal Genetiki i Selektsii, 2015, Vol. 19, No. 6, pp. 668–674.
Rights and permissions
About this article
Cite this article
Safronova, N.S., Ponomarenko, M.P., Abnizova, I.I. et al. Flanking monomer repeats determine decreased context complexity of single nucleotide polymorphism sites in the human genome. Russ J Genet Appl Res 6, 809–815 (2016). https://doi.org/10.1134/S2079059716070121
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S2079059716070121