Data Mining and Knowledge Discovery

, Volume 29, Issue 5, pp 1280–1311 | Cite as

DRESS: dimensionality reduction for efficient sequence search

  • Alexios Kotsifakos
  • Alexandra Stefan
  • Vassilis Athitsos
  • Gautam Das
  • Panagiotis Papapetrou


Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the \(t\) most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.


Similarity search Alphabet reduction Biological sequences 



The work of Vassilis Athitsos was partially supported by National Science Foundation grants IIS-1055062, CNS-1059235, CNS-1035913, and CNS-1338118. The work of Gautam Das was partially supported by National Science Foundation under grants 0812601, 0915834, 1018865 and grants from Microsoft Research.

Conflict of interest

The authors declare that they have no conflict of interest.


  1. Altschul S, Madden T, Schffer R, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402CrossRefGoogle Scholar
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRefGoogle Scholar
  3. Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of very large database endowment (PVLDB), pp 918–929Google Scholar
  4. Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82CrossRefGoogle Scholar
  5. Behm A, Vernica R, Alsubaiee S, Ji S, Lu J, Jin L, Lu Y, Li C (2010) UCI Flamingo Package 4.0.
  6. Bhadra R, Sandhya S, Abhinandan KR, Chakrabarti S, Sowdhamini R, Srinivasan N (2006) Cascade psi-blast web server: a remote homology search tool for relating protein domains. Nucleic Acids Res 34(Web–Server–Issue):143–146CrossRefGoogle Scholar
  7. Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Tech. Rep. 124, Systems Research Center, Palo Alto,
  8. Hjaltason G, Samet H (2003) Properties of embedding methods for similarity searching in metric spaces. IEEE Trans Pattern Anal Mach Intell (PAMI) 25(5):530–549CrossRefGoogle Scholar
  9. Jongeneel CV (2000) Searching the expressed sequence tag (est) databases: panning for genes. Bioinformatics 1:76–92Google Scholar
  10. Kalafus KJ, Jackson AR, Milosavljevic A (2004) Pash: efficient genome-scale sequence anchoring by positional hashing. Genome Resour 14(4):672–678CrossRefGoogle Scholar
  11. Kent WJ (2002) Resource BLAT-The BLAST-like alignment tool. Genome ResGoogle Scholar
  12. Kim MS, Whang KY, Lee JG, Lee MJ (2005a) n-gram/2l: a space and time efficient two-level n-gram inverted index structure. In: Proceedings of the 31st international conference on very large data bases, VLDB Endowment, pp 325–336Google Scholar
  13. Kim YJ, Boyd A, Athey BD, Patel JM (2005b) miblast: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res 33:4335–4344CrossRefGoogle Scholar
  14. Korf I, Gish W (2000) Mpblast : improved blast performance with multiplexed queries. Bioinformatics 16:1052–1053CrossRefGoogle Scholar
  15. Langmead B, Trapnell C, Pop M, Salzberg SL et al (2009) Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol 10(3):R25CrossRefGoogle Scholar
  16. Li C, Wang B, Yang X (2007) Vgram: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, pp 303–314Google Scholar
  17. Li C, Lu J, Lu Y (2008a) Efficient merging and filtering algorithms for approximate string searches. International conference on data engineering (ICDE)Google Scholar
  18. Li H, Ruan J, Durbin R (2008b) Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res 18(11):1851–1858CrossRefGoogle Scholar
  19. Li R, Li Y, Kristiansen K, Wang J (2008c) Soap: short oligonucleotide alignment program. Bioinformatics 24(5):713–714CrossRefGoogle Scholar
  20. Li Y, Patel JM, Terrell A (2012) Wham: a high-throughput sequence alignment method. ACM Trans Database Syst (TODS) 37(4):28Google Scholar
  21. Litwin W, Mokadem R, Rigaux P, Schwarz T (2007) Fast ngram-based string search over data encoded using algebraic signatures. In: Proceedings of the very large database endowment (PVLDB), pp 207–218Google Scholar
  22. Liu B, Wang X, Zou Q, Dong Q, Chen Q (2013) Protein remote homology detection by combining chous pseudo amino acid composition and profile-based protein representation. Mol Inf 32(9–10):775–782CrossRefGoogle Scholar
  23. Meek C, Patel JM, Kasetty S (2003) Oasis: an online and accurate technique for local-alignment searches on biological sequences. In: Proceedings of very large database endowment (PVLDB), vol 29, pp 910–921Google Scholar
  24. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453CrossRefGoogle Scholar
  25. Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: A fast search method for large dna databases. Genome Resour 11(10):1725–1729CrossRefGoogle Scholar
  26. Papapetrou P, Athitsos V, Kollios G, Gunopulos D (2009) Reference-based alignment in large sequence databases. Proc Very Large Database Endow (PVLDB) 2(1):205–216Google Scholar
  27. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197CrossRefGoogle Scholar
  28. Tian Y, Mceachin RC, Santos C, States DJ, Patel JM (2007) Saga: A subgraph matching tool for biological graphs. Bioinformatics 23(2):232–239CrossRefGoogle Scholar
  29. Traina C, Traina AJM, Seeger B, Faloutsos C (2000) Slim-trees: high performance metric trees minimizing overlap between nodes. International conference on extending database technology (EDBT), pp 51–65Google Scholar
  30. Venkateswaran J, Lachwani D, Kahveci T, Jermaine C (2006) Reference-based indexing of sequence databases. In: International conference on very large databases (VLDB), pp 906–917Google Scholar
  31. Vergoulis T, Dalamagas T, Sacharidis D, Sellis TK (2012) Approximate regional sequence matching for genomic databases. VLDB J 21(6):779–795CrossRefGoogle Scholar
  32. Vieira MR, Traina C, Chino FJT, Traina AJM (2004) Dbm-tree: a dynamic metric access method sensitive to local density data. Brazilian symposium on databases (SBBD), pp 163–177Google Scholar
  33. Wandelt S, Starlinger J, Bux M, Leser U (2013) Rcsi: scalable similarity search in thousand(s) of genomes. Proceedings of the VLDB Endowment (PVLDB) p (to appear)Google Scholar
  34. Wu S, Manber U (1992) Fast text searching: allowing errors. Commun ACM 35(10):83–91CrossRefGoogle Scholar
  35. Yan X, Yu PS, Han J (2005) Graph indexing based on discriminative frequent structure analysis. ACM Trans Database Syst 30(4):960–993CrossRefGoogle Scholar
  36. Yang X, Wang B, Li C (2008) Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, pp 353–364Google Scholar
  37. Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning dna sequences. J Comput Biol 7:203–214CrossRefGoogle Scholar
  38. Zhu H, Kollios G, Athitsos V (2012) A generic framework for efficient and effective subsequence retrieval. Proc VLDB Endow (PVLDB) 5(11):1579–1590CrossRefGoogle Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  • Alexios Kotsifakos
    • 1
  • Alexandra Stefan
    • 1
  • Vassilis Athitsos
    • 1
  • Gautam Das
    • 1
  • Panagiotis Papapetrou
    • 2
  1. 1.Department of Computer Science and EngineeringUniversity of Texas at ArlingtonArlingtonUSA
  2. 2.Department of Computer and Systems SciencesStockholm UniversityStockholmSweden

Personalised recommendations