Abstract
A new method to identify all sufficiently long repeating nucleotide substrings in one or several DNA sequences is proposed. The method based on a specific gauge applied to DNA sequences that guarantees identification of the repeating substrings. The method allows the matching substrings to contain a given level of errors. The gauge is based on the development of a heavily sparse dictionary of repeats, thus drastically accelerating the search procedure. Some biological applications illustrate the method.
References
Altschul, S.F., Gish, W., Miller, W., Meyers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: SPIRE 2000, pp. 39–48. IEEE Computer Society (2000)
Chen, G.L., Chang, Y.J., Hsueh, C.H.: PRAP: an ab initio software package for automated genome-wide analysis of DNA repeats for prokaryotes. Bioinformatics. 29(21), 2683–2689 (2013)
Erciyes, K.: Distributed and Sequential Algorithms for Bioinformatics. Computational Biology, vol. 23, p. 367. Springer, Cham (2015)
Girgis, H.Z.: Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinform. 16, 227 (2015)
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Lian, S., Chen, X., Wang, P., Zhang, X., Dai, X.: A complete and accurate Ab initio repeat finding algorithm. Interdiscip. Sci. 8(1), 75–83 (2016)
Maier, D.: The complexity of some problems on subsequences and supersequences. J. ACM 25(2), 322–336 (1978). ACM Press
Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980)
Novák, P., Neumann, P., Pech, J., Steinhaisl, J., Macas, J.: RepeatExplorer: a galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next-generation sequence reads. Bioinformatics 29(6), 792–793 (2013)
Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. PNAS USA 85, 2444–2448 (1988)
Sadovsky, M.G.: Information capacity of nucleotide sequences and its applications. Bull. Math. Biology. 68, 156 (2006)
Acknowledgments
This study was supported by a research grant No. 14.Y26.31.0004 from the Government of the Russian Federation (M.G. Sadovsky) and the grant from Russian Ministry of Education and Science to Siberian Federal University, contract No. 1.1462.2014/K (S.P. Tsarev).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Tsarev, S.P., Sadovsky, M.G. (2016). New Error Tolerant Method for Search of Long Repeats in DNA Sequences. In: Botón-Fernández, M., MartÃn-Vide, C., Santander-Jiménez, S., Vega-RodrÃguez, M.A. (eds) Algorithms for Computational Biology. AlCoB 2016. Lecture Notes in Computer Science(), vol 9702. Springer, Cham. https://doi.org/10.1007/978-3-319-38827-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-38827-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-38826-7
Online ISBN: 978-3-319-38827-4
eBook Packages: Computer ScienceComputer Science (R0)