Efficient Approximate Subsequence Matching Using Hybrid Signatures
In this paper, we focus on the problem of approximate subsequence matching, also called the read mapping problem in genomics, which is finding similar subsequences (A subsequence refers to a substring which has consecutive characters) of a query (DNA subsequence) from a reference genome under a user-specified similarity threshold k. Existing methods first extract subsequences from a query to generate signatures, then produce candidate positions using the generated signatures, and finally verify these candidate positions to obtain the true mapping positions. However, there exist two main issues in these works: (1) producing many candidate positions; and (2) generating large numbers of signatures, among which many signatures are redundant. To address the above two issues, we propose a novel filtering technique, called hybrid signatures, which can achieve a better balance between the filtering ability of signatures and the overhead of producing candidate positions. Accordingly, we devise an adaptive algorithm to produce candidate positions using hybrid signatures. Finally, the experimental results on real-world genomic sequences show that our method outperforms state-of-the-art methods in query efficiency.
KeywordsRead mapping Approximate subsequence matching Hybrid signatures
- 3.Kim, J., Li, C., Xie, X.: Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: ICDE 2016. IEEE (2016)Google Scholar
- 4.Yang, X., Wang, B., Li, C., Wang, J., Xie, X.: Efficient direct search on compressed genomic data. In: ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 961–972 (2013)Google Scholar
- 5.Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: improving the performance of approximate queries on string collections. In: SIGMOD 2015, pp. 377–392 (2015)Google Scholar