Abstract
DNA or protein repeats are recurring subsequences in these molecules. These repeats may be adjacent to each other in which case they are called tandem repeats or they may be dispersed named as sequence motifs. Discovery of such subsequences has various implications such as locating genes as they are frequently found near genes, comparing sequences, or disease analysis as the number of repeats is elevated in certain diseases. Instead of searching for exact repeats, we may be interested in finding approximate repeats, as these are encountered more frequently in experiments than exact ones due to mutations in sequences and erroneous measurements. We may search for repeats in a single sequence or a set of sequences. The detected repeats in the latter case provide also the conserved structures in the set which can be used to infer phylogenetic relationships. Discovery of these structures can be performed by combinatorial and probabilistic algorithms as we describe. Graph-based methods involve building of a k-partite similarity graph among k input sequences and then searching for cliques in this graph. A clique found this way will have a vertex in each partition and represent a common motif in all sequences. The distributed algorithms for this purpose are scarce and we propose two new algorithms to detect repeating sequences which can be easily experimented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abouelhoda MI, Kurtz S, Ohlebusch E (2002) The enhanced suffix array and its applications to genome analysis. In: Proceedings of WABI 2002, LNCS, vol 2452, pp 449–463. Springer
Bailey TL, Elkan C (1995b) The value of prior knowledge in discovering motifs with MEME. In: Proceedings of thethird international conference on intelligent systems for molecular biology, pp 21–29. AAAI Press
Bailey TL, Elkan C (1995a) Unsupervised leaning of multiple motifs in biopolymers using EM. Mach Learn 21:51–80
Chun-Hsi H, Sanguthevar R (2003) Parallel pattern identification in biological sequences on clusters. IEEE Trans Nanobiosci 2(1):29–34
Crochemore M (1981) An optimal algorithm for computing the repetitions in a word. Inf Process Lett 12(5):244–250
Eskin E, Pevzner PA (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18:354–363
Floratos A, Rigoutsos I (1998) On the time complexity of the TEIRESIAS algorithm. In: Research report RC 21161 (94582), IBM T.J. Watson Research Center
Gatchel JR, Zoghbi HY (2005) Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet 6:743–755
Goldstein DB, Schlotterer C (1999) Microsatellites: evolution and applications, 1st edn. Oxford University Press, ISBN-10: 0198504071, ISBN-13: 978-0198504078
Grundy WN, Bailey TL, Elkan CP (1996) ParaMEME: a parallel implementation and a web interface for a DNA and protein motif discovery tool. Comput Appl Biosci 12(4):303–310
Ikebata H, Yoshida R (2015) Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets. Bioinformatics 1–8: doi:10.1093/bioinformatics/btv017
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208–214
Lim KG, Kwoh CK, Hsu LY, Wirawan A (2012) Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Brief Bioinform. doi:10.1093/bib/bbs023
Marsan L, Sagot MF (2000) Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification. J Comput Biol 7(3/4):345–360
Matroud A (2013) Nested tandem repeat computation and analysis. Ph.D. Thesis, Massey University
Mejia YP, Olmos I, Gonzalez JA (2010) Structured motifs identification in DNA sequences. In: Proceedings of the twenty-third international florida artificial intelligence research society conference (FLAIRS 2010), pp 44–49
Modan K, Das MK, Dai H-K (2007) A survey of DNA motif finding algorithms. BMC Bioinform 8(Suppl 7):S21
Mohantyr S, Sahu B, Acharya AK (2013) Parallel implementation of exact algorithm for planted (l,d) motif search. In: Proceedings of the international conference on advances in computer science, AETACS
Mourad E, Albert YZ (eds) (2011) Algorithms in computational molecular biology: techniques, approaches and applications. Wiley series in bioinformatics, Chap. 18
Mourad E, Albert YZ (eds) (2011) Algorithms in computational molecular biology: techniques, approaches and applications, pp 386–387. Wiley
Nicolae M, Rajasekaran S (2014) Efficient sequential and parallel algorithms for planted motif search. BMC Bioinform 15:34. doi:10.1186/1471-2105-15-34
Pardalos PM, Rappe J, Resende MGC (1998) An exact parallel algorithm for the maximum clique problem. In: De Leone et al (eds) High performance algorithms and software in nonlinear optimization, vol 24. Kluwer, Dordrecht, pp 279–300
Parson W, Kirchebner R, Muhlmann R, Renner K, Kofler A, Schmidt S, Kofler R (2005) Cancer cell line identification by short tandem repeat profiling: power and limitations. FASEB J 19(3):434–436
Pelotti S, Ceccardi S, Alu M, Lugaresi F, Trane R, Falconi M, Bini C, Cicognani A (2008) Cancerous tissues in forensic genetic analysis. Genet Test 11(4):397–400
Pevzner P, Sze S (2000) Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the eighth international conference on intelligent systems on molecular biology. San Diego, CA, pp 269–278
Rajasekaran S, Balla S, Huang C-H, Thapar V, Gryk M, Maciejewski M, Schiller M (2005) High-performance exact algorithms for motif search. J Clin Monitor Comput 19:319–328
Rajasekaran S, Balla S, Huang C-H (2005) Exact algorithms for planted motif problems. J Comput Biol 12(8):1117–1128
Sagot MF (1998) Spelling approximate repeated or common motifs using a suffix tree. In: Proceedings of the theoretical informatics conference (Latin98), pp 111–127
Sahli M, Mansour E, Kalnis P (2014) ACME: Efficient parallel motif extraction from very long sequences. VLDB J 23:871–893
Satya RV, Mukherjee A (2004) New algorithms for Finding monad patterns in DNA sequences. In: Proceedings of SPIRE 2004, LNCS, vol 3246, pp 273–285. Springer
Srinivas A (ed) (2005) Handbook of computational molecular biology. Computer and information science series, Chap. 5, December 21, 2005. Chapman & Hall/CRC, Boca Raton
Stoye J, Gusfield D (2002) Simple and flexible detection of contiguous repeats using a suffix tree. Theor Comput Sci 1–2:843–856
Thota S, Balla S, Rajasekaran S (2007) Algorithms for motif discovery based on edit distance. In: Technical report, BECAT/CSE-TR-07-3
Zambelli F, Pesole G, Pavesi G (2012) Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 14:225–237
Zhang S, Li S, Niu M, Pham PT, Su Z (2011) MotifClick: prediction of cis-regulatory binding sites via merging cliques. BMC Bioinf 12:238
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Erciyes, K. (2015). Sequence Repeats. In: Distributed and Sequential Algorithms for Bioinformatics. Computational Biology, vol 23. Springer, Cham. https://doi.org/10.1007/978-3-319-24966-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-24966-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24964-3
Online ISBN: 978-3-319-24966-7
eBook Packages: Computer ScienceComputer Science (R0)