Lossless Filter for Finding Long Multiple Approximate Repetitions Using a New Data Structure, the Bi-factor Array

Peterlongo, Pierre; Pisanti, Nadia; Boyer, Frederic; Sagot, Marie-France

doi:10.1007/11575832_20

Pierre Peterlongo¹⁸,
Nadia Pisanti^19,20,
Frederic Boyer²¹ &
…
Marie-France Sagot^21,22

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3772))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1517 Accesses
8 Citations

Abstract

Similarity search in texts, notably biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the resolution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two sequences or occurring twice in the same sequence. In this paper, we present an algorithm called NIMBUS for filtering sequences prior to finding repetitions occurring more than twice in a sequence or in more than two sequences. NIMBUS uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with NIMBUS a data set where one wants to find functional elements using a multiple local alignment tool such as GLAM ([7]), the overall execution time can be reduced from 10 hours to 6 minutes while obtaining exactly the same results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: A basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Google Scholar
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI–BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)
Article Google Scholar
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.-P., Rivals, E., Vingron, M.: q-gram based database searching using a suffix array (quasar). In: Proceedings of 3rd RECOMB, pp. 77–83 (1999)
Google Scholar
Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, p. 73. Springer, Heidelberg (2001)
Chapter Google Scholar
Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.-F.: A highly scalable algorithm for the extraction of cis-regulatory regions. Advances in Bioinformatics and Computational Biology 1, 273–282 (2005)
Article Google Scholar
Tettelin, H., et al.: Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science 287(5459), 1809–1815 (2000)
Article Google Scholar
Frith, M.C., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Res. 32 (2004)
Google Scholar
Iliopoulos, C.S., McHugh, J., Peterlongo, P., Pisanti, N., Rytter, W., Sagot, M.: A first approach to finding common motifs with gaps. International Journal of Foundations of Computer Science (2004)
Google Scholar
Karkkainen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. Assoc. Comput. Mach. (to appear)
Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Chapter Google Scholar
Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 186–199. Springer, Heidelberg (2003)
Chapter Google Scholar
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms (to appear)
Google Scholar
Kolpakov, R., Bana, G., Kucherov, G.: mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31(13), 3672–3678 (2003)
Article Google Scholar
Krucherov, G., Noé, L., Roytberg, M.: Multi-seed lossless filtration. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 297–310. Springer, Heidelberg (2004)
Chapter Google Scholar
Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter ii: Highly sensitive and fast homology search. J. of Comput. Biol. (2004)
Google Scholar
Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity searches. Sci. 227, 1435–1441 (1985)
Article Google Scholar
Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)
Article Google Scholar
Marsan, L., Sagot, M.-F.: Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification. J. of Comput. Biol. (7), 345–360 (2000)
Google Scholar
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–363. Springer, Heidelberg (2000)
Chapter Google Scholar
Ovcharenko, I., Loots, G.G., Giardine, B.M., Hou, M., Ma, J., Hardison, R.C., Stubbs, L., Miller, W.: Mulan: Multiple-sequence local alignment and visualization for studying function and evolution. Genome Research 15, 184–194 (2005)
Article Google Scholar
Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all ε-matches over a given length. In: Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Institut Gaspard-Monge, Universite de Marne-la-Vallée, France
Pierre Peterlongo
Dipartimento di Informatica, Università di Pisa, Italy
Nadia Pisanti
LIPN, Université Paris-Nord, France
Nadia Pisanti
INRIA Rhône-Alpes and LBBE, Univ. Claude Bernard, Lyon, France
Frederic Boyer & Marie-France Sagot
King’s College, London, UK
Marie-France Sagot

Authors

Pierre Peterlongo
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Pisanti
View author publications
You can also search for this author in PubMed Google Scholar
Frederic Boyer
View author publications
You can also search for this author in PubMed Google Scholar
Marie-France Sagot
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Toronto,
Mariano Consens
Dept. of Computer Science, University of Chile,
Gonzalo Navarro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peterlongo, P., Pisanti, N., Boyer, F., Sagot, MF. (2005). Lossless Filter for Finding Long Multiple Approximate Repetitions Using a New Data Structure, the Bi-factor Array. In: Consens, M., Navarro, G. (eds) String Processing and Information Retrieval. SPIRE 2005. Lecture Notes in Computer Science, vol 3772. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575832_20

Download citation

DOI: https://doi.org/10.1007/11575832_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29740-6
Online ISBN: 978-3-540-32241-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics