Gapped Local Similarity Search with Provable Guarantees

* Final gross prices may vary according to local VAT.

Get Access

Abstract

We present a program qhash, based on q-gram filtration and high-dimensional search, to find gapped local similarities between two sequences. Our approach differs from past q-gram-based approaches in two main aspects. Our filtration step uses algorithms for a sparse all-pairs problem, while past studies use suffix-tree-like structures and counters. Our program works in sequence-sequence mode, while most past ones (except QUASAR) work in pattern-database mode.

We leverage existing research in high-dimensional proximity search to discuss sparse all-pairs algorithms, and show them to be subquadratic under certain reasonable input assumptions. Our qhash program has provable sensitivity (even on worst-case inputs) and average-case performance guarantees. It is significantly faster than a fully sensitive dynamic-programming-based program for strong similarity search on longsequences.