The VLDB Journal

, Volume 21, Issue 6, pp 779–795

Approximate regional sequence matching for genomic databases

  • Thanasis Vergoulis
  • Theodore Dalamagas
  • Dimitris Sacharidis
  • Timos Sellis
Regular Paper

DOI: 10.1007/s00778-012-0270-1

Cite this article as:
Vergoulis, T., Dalamagas, T., Sacharidis, D. et al. The VLDB Journal (2012) 21: 779. doi:10.1007/s00778-012-0270-1

Abstract

Recent advances in computational biology have raised sequence matching requirements that result in new types of sequence database problems. In this work, we introduce an important class of such problems, the approximate regional sequence matching (ARSM) problem. Given a data and a pattern sequence, an ARSM result is an approximate occurrence of a region of the pattern in the data sequence under two conditions. First, the region must contain a predetermined area of the pattern sequence, termed core. Second, the allowable deviation between the region of the pattern and its occurrence in the data sequence depends on the length of the region. We propose the PS-ARSM method that processes holistically the regions of a pattern, taking advantage of their overlaps to efficiently identify the ARSM results. Its performance is evaluated with respect to existing techniques adapted to the ARSM problem.

Keywords

Sequence matching Genomic databases 

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Thanasis Vergoulis
    • 1
  • Theodore Dalamagas
    • 2
  • Dimitris Sacharidis
    • 2
  • Timos Sellis
    • 1
  1. 1.NTUA & IMIS, Athena RCAthensGreece
  2. 2.IMIS, Athena RCAthensGreece