Abstract
As a first step in designing relatively-compressed data structures—i.e., such that storing an instance for one dataset helps us store instances for similar datasets—we consider how to compress spaced suffix arrays relative to normal suffix arrays and still support fast access to them. This problem is of practical interest when performing similarity search with spaced seeds because using several seeds in parallel significantly improves their performance, but with existing approaches we keep a separate linear-space hash table or spaced suffix array for each seed. We first prove a theoretical upper bound on the space needed to store a spaced suffix array when we already have the suffix array. We then present experiments indicating that our approach works even better in practice.
Similar content being viewed by others
References
Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69, 232–268 (2014)
Battaglia, G., Cangelosi, D., Grossi, R., Pisanti, N.: Masking patterns in sequences: a new class of motif discovery with don’t cares. Theor. Comput. Sci. 410, 4327–4340 (2009)
Belazzougui, D., Gagie, T., Gog, S., Manzini, G., Sirén, J.: Relative FM-indexes. In: Proceedings of the 21st Symposium on String Processing and Information Retrieval (SPIRE), pp. 52–64 (2014)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 11(4) (2015)
Boucher, C., Bowe, A., Gagie, T., Manzini, G., Sirén, J.: Relative select. In: Proceedings of the 22nd Symposium on String Processing and Information Retrieval (SPIRE), pp. 149–155 (2015)
Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Proceedings of the 12th Workshop on Algorithms in Bioinformatics (WABI), pp. 225–235 (2012)
Brown, D.G.: A survey of seeding for sequence alignment. In: Mǎndoiu, I., Zelikovsky, A. (eds.) Bioinformatics Algorithms: Techniques and Applications, pp. 126–152. Wiley-Interscience, Hoboken (2008)
Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundamenta Informicae 56, 51–70 (2003)
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Crochemore, M., Tischler, G.: The gapped suffix array: a new index structure for fast approximate matching. In: Proceedings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pp. 359–364 (2010)
David, M., Dzamba, M., Lister, D., Ilie, L., Brudno, M.: SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 27, 1011–1012 (2011)
Egidi, L., Manzini, G.: Better spaced seeds using quadratic residues. J. Comput. Syst. Sci. 79, 1144–1155 (2013)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 552–581 (2005)
Gagie, T., Manzini, G., Valenzuela, D.: Compressed spaced suffix arrays. In: Proceedings of the 2nd International Conference on Algorithms for Big Data (ICABD), pp. 37–45 (2014)
Gagie, T., Navarro, G., Puglisi, S.J., Sirén, J.: Relative compressed suffix trees. Technical Report. arXiv:1508.02550 (2015)
Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome resequencing. PLOS One 4, e7767 (2009)
Ilie, L., Ilie, S., Khoshraftar, S., Mansouri Bigvand, A.: Seeds for effective oligonucleotide design. BMC Genomics 12, 280 (2011)
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012)
Kiełbasa, S.M., Wan, R., Sato, K., Horton, P., Frith, M.C.: Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011)
Langmeand, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)
Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Peterlongo, P., Pisanti, N., Boyer, F., Pereira do Lago, A., Sagot, M.: Lossless filter for multiple repetitions with Hamming distance. J. Discrete Algorithms 6(3), 497–509 (2008)
Russo, L.M.S., Tischler, G.: Succinct gapped suffix arrays. In: Proceedings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pp. 290–294 (2011)
Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 12, 847–861 (2005)
Supowit, K.J.: Decomposing a set of points into chains, with applications to permutation and circle graphs. Inform. Process. Lett. 21, 249–252 (1985)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gagie, T., Manzini, G. & Valenzuela, D. Compressed Spaced Suffix Arrays. Math.Comput.Sci. 11, 151–157 (2017). https://doi.org/10.1007/s11786-016-0283-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11786-016-0283-z