The VLDB Journal

, Volume 21, Issue 6, pp 779–795 | Cite as

Approximate regional sequence matching for genomic databases

  • Thanasis Vergoulis
  • Theodore Dalamagas
  • Dimitris Sacharidis
  • Timos Sellis
Regular Paper

Abstract

Recent advances in computational biology have raised sequence matching requirements that result in new types of sequence database problems. In this work, we introduce an important class of such problems, the approximate regional sequence matching (ARSM) problem. Given a data and a pattern sequence, an ARSM result is an approximate occurrence of a region of the pattern in the data sequence under two conditions. First, the region must contain a predetermined area of the pattern sequence, termed core. Second, the allowable deviation between the region of the pattern and its occurrence in the data sequence depends on the length of the region. We propose the PS-ARSM method that processes holistically the regions of a pattern, taking advantage of their overlaps to efficiently identify the ARSM results. Its performance is evaluated with respect to existing techniques adapted to the ARSM problem.

Keywords

Sequence matching Genomic databases 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)Google Scholar
  2. 2.
    Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
  3. 3.
    Baeza-Yates R.A., Navarro G.: Faster approximate string matching. Algorithmica 23(2), 127–158 (1999)MathSciNetMATHCrossRefGoogle Scholar
  4. 4.
    Baeza-Yates R.A., Navarro G.: New and faster filters for multiple approximate string matching. Random Struct. Algorithms 20(1), 23–49 (2002)MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Baeza-Yates R.A., Perleberg C.H.: Fast and practical approximate string matching. Inf. Process. Lett. 59(1), 21–27 (1996)MathSciNetMATHCrossRefGoogle Scholar
  6. 6.
    Chang, W.I., Marr, T.G.: Approximate string matching and local similarity. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 807, pp. 259–273. Springer, New York (1994)Google Scholar
  7. 7.
    Doench J.G., Sharp P.A.: Specificity of microrna target selection in translational repression. Genes Dev. 18(5), 504–511 (2004)CrossRefGoogle Scholar
  8. 8.
    Fredriksson K., Navarro G.: Average-optimal single and multiple approximate string matching. ACM J. Exp. Algorithms 9, 1–4 (2004)MathSciNetGoogle Scholar
  9. 9.
    Gusfield D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1999)Google Scholar
  10. 10.
    Hyyrö, H., Navarro, G.: Faster bit-parallel approximate string matching. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 2373, pp. 203–224. Springer, New York (2002)Google Scholar
  11. 11.
    Jokinen P., Tarhio J., Ukkonen E.: A comparison of approximate string matching algorithms. Softw. Pract. Exp. 26(12), 1439–1458 (1996)CrossRefGoogle Scholar
  12. 12.
    Kim Y.J., Boyd A., Athey B.D., Patel J.M.: miblast: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res. 33(13), 4335–4344 (2005)CrossRefGoogle Scholar
  13. 13.
    Korf I., Gish W.: Mpblast: improved blast performance with multiplexed queries. Bioinformatics 16(11), 1052–1053 (2000)CrossRefGoogle Scholar
  14. 14.
    Levenshtein V.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1, 8–17 (1965)Google Scholar
  15. 15.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966). Original in Russian in Dokl. Akad. Nauk SSSR 163(4), 845–848 (1965)Google Scholar
  16. 16.
    Li, Y., Terrell, A., Patel, J.M.: Wham: a high-throughput sequence alignment method. In: SIGMOD Conference, pp. 445–456 (2011)Google Scholar
  17. 17.
    Lipman D.J., Pearson W.R.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985)CrossRefGoogle Scholar
  18. 18.
    Maragkakis M., Reczko M., Simossis V.A., Alexiou P., Papadopoulos G.L., Dalamagas T., Giannopoulos G., Goumas G., Koukis E., Kourtis K., Vergoulis T., Koziris N., Sellis T., Tsanakas P., Hatzigeorgiou A.G.: Diana-microt web server: elucidating microrna functions through target prediction. Nucleic Acids Res. 37(suppl 2), W273–W276 (2009)CrossRefGoogle Scholar
  19. 19.
    Meek, C., Patel, J.M., Kasetty, S.: Oasis: an online and accurate technique for local-alignment searches on biological sequences. In: VLDB, pp. 910–921 (2003)Google Scholar
  20. 20.
    Muth, R., Mamber, U.: Approximate multiple string search. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 1075, pp. 75–86. Springer, New York (1996)Google Scholar
  21. 21.
    Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)CrossRefGoogle Scholar
  22. 22.
    Navarro G., Baeza-Yates R.A., Sutinen E., Tarhio J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. (DEBU) 24(4), 19–27 (2001)Google Scholar
  23. 23.
    Navarro G., Fredriksson K.: Average complexity of exact and approximate multiple string matching. Theor. Comput. Sci. 321(2–3), 283–290 (2004)MathSciNetMATHCrossRefGoogle Scholar
  24. 24.
    Needleman S.B., Wunsch C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRefGoogle Scholar
  25. 25.
    Papapetrou P., Athitsos V., Kollios G., Gunopulos D.: Reference-based alignment in large sequence databases. PVLDB 2(1), 205–216 (2009)Google Scholar
  26. 26.
    Pearson W.R., Lipman D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85(8), 2444–2448 (1988)CrossRefGoogle Scholar
  27. 27.
    Sankoff D., Kruskal J.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA (1983)Google Scholar
  28. 28.
    Sellers P.H.: An algorithm for the distance between two finite sequences. J. Combin. Theory Ser. A 16, 253–258 (1974)MathSciNetMATHCrossRefGoogle Scholar
  29. 29.
    Sellers P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1(4), 359–373 (1980)MathSciNetMATHCrossRefGoogle Scholar
  30. 30.
    Smith T.F., Waterman M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–207 (1981)CrossRefGoogle Scholar
  31. 31.
    Ukkonen E.: Finding approximate patterns in strings. J. Algorithms 6, 132–137 (1985)MathSciNetMATHCrossRefGoogle Scholar
  32. 32.
    Zhang Z., Schwartz S., Wagner L., Miller W.: A greedy algorithm for aligning dna sequences. J. Comput. Biol. 7(1–2), 203–214 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Thanasis Vergoulis
    • 1
  • Theodore Dalamagas
    • 2
  • Dimitris Sacharidis
    • 2
  • Timos Sellis
    • 1
  1. 1.NTUA & IMIS, Athena RCAthensGreece
  2. 2.IMIS, Athena RCAthensGreece

Personalised recommendations