Approximate regional sequence matching for genomic databases
- First Online:
- Cite this article as:
- Vergoulis, T., Dalamagas, T., Sacharidis, D. et al. The VLDB Journal (2012) 21: 779. doi:10.1007/s00778-012-0270-1
Recent advances in computational biology have raised sequence matching requirements that result in new types of sequence database problems. In this work, we introduce an important class of such problems, the approximate regional sequence matching (ARSM) problem. Given a data and a pattern sequence, an ARSM result is an approximate occurrence of a region of the pattern in the data sequence under two conditions. First, the region must contain a predetermined area of the pattern sequence, termed core. Second, the allowable deviation between the region of the pattern and its occurrence in the data sequence depends on the length of the region. We propose the PS-ARSM method that processes holistically the regions of a pattern, taking advantage of their overlaps to efficiently identify the ARSM results. Its performance is evaluated with respect to existing techniques adapted to the ARSM problem.