To Detect and Analyze Sequence Repeats Whatever Be Their Origin

Nicolas, Jacques

doi:10.1007/978-1-61779-603-6_4

Jacques Nicolas²

Part of the book series: Methods in Molecular Biology ((MIMB,volume 859))

3455 Accesses

Abstract

The development of numerous programs for the identification of mobile elements raises the issue of the founding concepts that are shared in their design. This is necessary for at least three reasons. First, the cost of designing, developing, debugging, and maintaining software could present a danger of distracting biologists from their main bioanalysis tasks that require a lot of energy. Some key concepts on exact repeats are always underlying the search for genomic repeats and we recall the most important ones. All along the chapter, we try to select practical tools that may help the design of new identification pipelines. Second, the huge increase of sequence production capacities requires to use the most efficient data structures and algorithms to scale up tools in front of the data deluge. This paper provides an up-to-date glimpse on the art of string indexing and string matching. Third, there exists a growing knowledge on the architecture of mobile elements built from literature and the analysis of results generated by these pipelines. Besides data management which has led to the discovery of new families or new elements of a family, the community has an increasing need in knowledge management tools in order to compare, validate, or simply keep trace of mobile element models. We end the paper with first considerations on what could help the near future of such research on models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 159.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
If SA denotes the suffix array of sequence s, then BWt[i], the ith letter of the BWt, is s[SA[i] − 1 mod|s|]. In our example, Bwt corresponds to string TTTACTTTCGTG.
2.
Namely, Cost(Indel) = Cost(Mismatch) − Cost(Match)/2.

References

Jurka J, et al. (2005) Repbase Update, a database of eukaryotic repetitive elements. Cyt Gen Res. 110:462–467
Article CAS Google Scholar
Flutre T., et al. (2011) Considering transposable element diversification in de novo annotation approaches. PLoS ONE. 6:1
Article Google Scholar
Reinert G, Schbath S, Waterman MS (2005) Probabilistic and Statistical Properties of Finite Words in Finite Sequences. J Berstel and D Perrin (eds.). In Applied Combinatorics on Words. Cambridge University Press
Google Scholar
Ussery D, Wassenaar T, Borini S (2009) Word Frequencies and Repeats. Computing for Comparative Microbial Genomics: Bioinformatics for Microbiologists. Computational Biology. s.l.: Springer. 2009, Chapters 7 and 8, pp. 111–150
Google Scholar
Lefebvre A, Lecroq T, Alexandre J (2003) An improved algorithm for finding longest repeats with a modified factor oracle. Journal of Automata, Languages and Combinatorics 8:347–658
Google Scholar
Lefebvre A, et al. (2003) FORRepeats: detects repeats on entire chromosomes and between genomes. Bioinformatics 19:319–326
Article PubMed CAS Google Scholar
Crochemore M, Ilie L, Rytter W (2009) Repetitions in strings: algorithms and combinatorics. Theoret Comput Sci 410(50):5227–5235
Google Scholar
Manber U, Myers G (1990) Suffix arrays: A new method for on-line string searches. In Proceedings of the 1st ACM-SIAM Symposium on Discrete Algorithms. Ed. Edited Dana Randall, pp. 319–327
Google Scholar
Puglisi SJ, Smyth WF, Turpin AH (2007) A taxonomy of suffix array construction algorithms. ACM Comput. Surv 39:1–31
Article Google Scholar
Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Disc Algo 4:53–86
Article Google Scholar
Pokrzywa R, Polanski A (2010) BWtrs: A tool for searching for tandem repeats in DNA sequences based on the Burrows-Wheeler transform. Genomics 96:316–321
Google Scholar
Nong G, Zhang S, Chan W. (2009) Linear Suffix Array Construction by Almost Pure Induced-Sorting, Proceedings of 19th IEEE Data Compression Conference (IEEE DCC). Mar. 2009, Snowbird, UT, USA, pp. 193–202
Google Scholar
Homann R, et al. (2009) mkESA: enhanced suffix array construction tool. Bioinformatics. 25:1084–1085
Article PubMed CAS Google Scholar
Schnattinger T, Ohlebusch E, Gog S (2010) Bidirectional search in a string with wavelet trees. In Proceedings of the 21st annual conference on Combinatorial pattern matching (CPM’10). Amihood Amir and Laxmi Parida (Eds.). Springer-Verlag. pp. 40–50
Google Scholar
Price AL, Jones NC, Pevzner PA (2005) De novo identification of repeat families in large genomes. Proceedings of the 13th Annual International conference on Intelligent Systems for Molecular Biology (ISMB-05). Detroit, Michigan
Google Scholar
Li R, et al. (2005) ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput 1:4
Article Google Scholar
Noe L, Kucherov G (2005) YASS: enhancing the sensitivity of DNA similarity search. Nucl Acids Res 33: 540-W543
Article Google Scholar
Kucherov G, Noe L, Roytberg M (2006) A unifying framework for seed sensitivity and its application to subset seeds. J. Bioinf Comp Biol 4:553–569
Article CAS Google Scholar
Nguyen VH, Lavenier D (2009) PLAST: parallel local alignment search tool for database comparison BMC Bioinformatics 10:329
Google Scholar
Kiełbasa SM, et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res 21:487–493
Article PubMed Google Scholar
Krumsiek J, et al. (2007) A rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 23:1026–1028
Article PubMed CAS Google Scholar
Durand P, et al. (2006) Browsing repeats in genomes: Pygram and an application to non-coding region analysis. BMC Bioinformatics 7:477
Article PubMed Google Scholar
Sokol D, Atagun F (2010) TRedD: A database for tandem repeats over the edit distance. Database: article ID baq003
Google Scholar
Krzywinski M, et al. (2009) Circos: an information aesthetic for comparative genomics. Gen Res 19:1639–1645
Article CAS Google Scholar
Tempel S, et al. (2010) ModuleOrganizer: detecting modules in families of transposable elements. BMC Bioinformatics 11:474
Article PubMed Google Scholar
Belleannée C, Nicolas J (2007) Logol: Modelling evolving sequence families through a dedicated constrained string language. Inria Research report RR-6350:19
Google Scholar
Li M, et al. (2004) Highly sensitive and fast homology search. J Bioinform Comput Biol 2:417–439
Article PubMed CAS Google Scholar
Weber MJ (2006) Mammalian Small Nucleolar RNAs Are Mobile Genetic Elements PLoS Genet 2:e205
Google Scholar
Grzebelus D, et al. (2007) Diversity and structure of PIF/Harbinger-like elements in the genome of Medicago truncatula. BMC Genomics 8:409
Article PubMed Google Scholar
Roytberg M, et al. (2009) On Subset Seeds for Protein Alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 6:483–494
Article PubMed CAS Google Scholar
Hughes JF, et al. (2010) Chimpanzee and human Y chromosomes are remarkably divergent in structure gene content. Nature 463:536–539
Article PubMed CAS Google Scholar
Rousseau C, et al. (2009) CRISPI: a CRISPR interactive database. Bioinformatics 25:3317–3318.
Article PubMed CAS Google Scholar
Brudno M, et al. (2007) Multiple whole genome alignments and novel biomedical applications at the VISTA portal. Nucl Acids Res 35:W669-W674
Article PubMed Google Scholar
Nix DA, Eisen MB (2005) GATA: a graphic alignment tool for comparative sequence analysis. BMC Bioinformatics 6:9
Article PubMed Google Scholar
Darzentas N (2010) Circoletto: visualizing sequence similarity with Circos. Bioinformatics 26:2620–2621
Article PubMed CAS Google Scholar
Tempel S, et al. (2006) Domain organization within repeated DNA sequences: application to the study of a family of transposable elements. Bioinformatics. 22:1948–1954
Article PubMed CAS Google Scholar
Feschotte C, et al. (2009) Exploring repetitive DNA landscapes using REPCLASS, a tool that automates the classification of transposable elements in eukaryotic genomes. Gen Biol Evol 1:205–220
Article Google Scholar
Estill JC, Bennetzen JL (2009) The DAWGPAWS pipeline for the annotation of genes and transposable elements in plant genomes. Plant Met 5:8
Article Google Scholar
Han Y, Wessler SR (2010) MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucl Acids Res 38:e199
Article PubMed Google Scholar
Kurtz S (2011) The Vmatch large scale sequence analysis software. A Manual. Unpublished report. Center for Bioinformatics Univ. of Hamburg, http://www.vmatch.de/virtman.pdf; + 2 other manuals “Chaining pairwise matches using the program chain2dim. Manual” and “Clustering Matches using the program matchcluster. Manual”
Morgante M, et al. (2005) A Structured motifs search. J Comput Biol. 12:1065–1082.
Article PubMed CAS Google Scholar
Zhang Y, Zaki MJ (2006) SMOTIF: efficient structured pattern and profile motif search. Algorithms Mol Biol 21:1–22
Google Scholar
Ellinghaus D, Kurtz S, Willhoeft U (2008) LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9:18
Article PubMed Google Scholar
Searls DB (1993) String variable grammar: a logic grammar formalism for the biological language of DNA. J Logic Program 24:73–102
Article Google Scholar
Searls DB (2002) The language of genes. Nature 420:211–217
Article PubMed CAS Google Scholar
Nicolas J et al. (2005) Suffix-tree analyser (STAN): looking for nucleotidic and peptidic patterns in chromosomes. Bioinformatics 21:4408–4410
Article PubMed CAS Google Scholar

Download references

Acknowledgments

This work was supported in part by a grant from the Agence Nationale de la Recherche [project Modulome ANR-05-MMSA-0010-01].

Author information

Authors and Affiliations

IRISA, INRIA centre de recherche Rennes-Bretagne Atlantique, Campus Universitaire de Beaulieu, Rennes Cedex, France
Jacques Nicolas

Authors

Jacques Nicolas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacques Nicolas .

Editor information

Editors and Affiliations

, Physiologie de la Reproduction, UMR INRA-CNRS 6175, n/a, Nouzilly Cedex, 37380, France
Yves Bigot

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Nicolas, J. (2012). To Detect and Analyze Sequence Repeats Whatever Be Their Origin. In: Bigot, Y. (eds) Mobile Genetic Elements. Methods in Molecular Biology, vol 859. Humana Press. https://doi.org/10.1007/978-1-61779-603-6_4

Download citation

DOI: https://doi.org/10.1007/978-1-61779-603-6_4
Published: 31 January 2012
Publisher Name: Humana Press
Print ISBN: 978-1-61779-602-9
Online ISBN: 978-1-61779-603-6
eBook Packages: Springer Protocols

Publish with us

Policies and ethics