Compressed Spaced Suffix Arrays

Gagie, Travis; Manzini, Giovanni; Valenzuela, Daniel

doi:10.1007/s11786-016-0283-z

Compressed Spaced Suffix Arrays

Published: 02 February 2017

Volume 11, pages 151–157, (2017)
Cite this article

Mathematics in Computer Science Aims and scope Submit manuscript

Travis Gagie¹,
Giovanni Manzini^2,3 &
Daniel Valenzuela⁴

141 Accesses
2 Citations
Explore all metrics

Abstract

As a first step in designing relatively-compressed data structures—i.e., such that storing an instance for one dataset helps us store instances for similar datasets—we consider how to compress spaced suffix arrays relative to normal suffix arrays and still support fast access to them. This problem is of practical interest when performing similarity search with spaced seeds because using several seeds in parallel significantly improves their performance, but with existing approaches we keep a separate linear-space hash table or spaced suffix array for each seed. We first prove a theoretical upper bound on the space needed to store a spaced suffix array when we already have the suffix array. We then present experiments indicating that our approach works even better in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69, 232–268 (2014)
Article MathSciNet MATH Google Scholar
Battaglia, G., Cangelosi, D., Grossi, R., Pisanti, N.: Masking patterns in sequences: a new class of motif discovery with don’t cares. Theor. Comput. Sci. 410, 4327–4340 (2009)
Article MathSciNet MATH Google Scholar
Belazzougui, D., Gagie, T., Gog, S., Manzini, G., Sirén, J.: Relative FM-indexes. In: Proceedings of the 21st Symposium on String Processing and Information Retrieval (SPIRE), pp. 52–64 (2014)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 11(4) (2015)
Boucher, C., Bowe, A., Gagie, T., Manzini, G., Sirén, J.: Relative select. In: Proceedings of the 22nd Symposium on String Processing and Information Retrieval (SPIRE), pp. 149–155 (2015)
Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Proceedings of the 12th Workshop on Algorithms in Bioinformatics (WABI), pp. 225–235 (2012)
Brown, D.G.: A survey of seeding for sequence alignment. In: Mǎndoiu, I., Zelikovsky, A. (eds.) Bioinformatics Algorithms: Techniques and Applications, pp. 126–152. Wiley-Interscience, Hoboken (2008)
Google Scholar
Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundamenta Informicae 56, 51–70 (2003)
MathSciNet MATH Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Crochemore, M., Tischler, G.: The gapped suffix array: a new index structure for fast approximate matching. In: Proceedings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pp. 359–364 (2010)
David, M., Dzamba, M., Lister, D., Ilie, L., Brudno, M.: SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 27, 1011–1012 (2011)
Article Google Scholar
Egidi, L., Manzini, G.: Better spaced seeds using quadratic residues. J. Comput. Syst. Sci. 79, 1144–1155 (2013)
Article MathSciNet MATH Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 552–581 (2005)
Article MathSciNet MATH Google Scholar
Gagie, T., Manzini, G., Valenzuela, D.: Compressed spaced suffix arrays. In: Proceedings of the 2nd International Conference on Algorithms for Big Data (ICABD), pp. 37–45 (2014)
Gagie, T., Navarro, G., Puglisi, S.J., Sirén, J.: Relative compressed suffix trees. Technical Report. arXiv:1508.02550 (2015)
Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome resequencing. PLOS One 4, e7767 (2009)
Article Google Scholar
Ilie, L., Ilie, S., Khoshraftar, S., Mansouri Bigvand, A.: Seeds for effective oligonucleotide design. BMC Genomics 12, 280 (2011)
Article Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012)
Article Google Scholar
Kiełbasa, S.M., Wan, R., Sato, K., Horton, P., Frith, M.C.: Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011)
Article Google Scholar
Langmeand, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)
Article Google Scholar
Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Article Google Scholar
Peterlongo, P., Pisanti, N., Boyer, F., Pereira do Lago, A., Sagot, M.: Lossless filter for multiple repetitions with Hamming distance. J. Discrete Algorithms 6(3), 497–509 (2008)
Article MathSciNet MATH Google Scholar
Russo, L.M.S., Tischler, G.: Succinct gapped suffix arrays. In: Proceedings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pp. 290–294 (2011)
Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 12, 847–861 (2005)
Article Google Scholar
Supowit, K.J.: Decomposing a set of points into chains, with applications to permutation and circle graphs. Inform. Process. Lett. 21, 249–252 (1985)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Diego Portales University and CEBIB, Santiago, Chile
Travis Gagie
University of Eastern Piedmont, Alessandria, Italy
Giovanni Manzini
IIT-CNR, Pisa, Italy
Giovanni Manzini
University of Helsinki, Helsinki, Finland
Daniel Valenzuela

Authors

Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Manzini
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Valenzuela
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Travis Gagie.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gagie, T., Manzini, G. & Valenzuela, D. Compressed Spaced Suffix Arrays. Math.Comput.Sci. 11, 151–157 (2017). https://doi.org/10.1007/s11786-016-0283-z

Download citation

Received: 22 April 2014
Revised: 13 August 2015
Accepted: 02 March 2016
Published: 02 February 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11786-016-0283-z

Keywords

Mathematics Subject Classification

68P05

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compressed Spaced Suffix Arrays

Abstract

Access this article

Similar content being viewed by others

Improved and Extended Locating Functionality on Compressed Suffix Arrays

Faster Compressed Suffix Trees for Repetitive Text Collections

Relative Lempel-Ziv Compression of Suffix Arrays

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Compressed Spaced Suffix Arrays

Abstract

Access this article

Similar content being viewed by others

Improved and Extended Locating Functionality on Compressed Suffix Arrays

Faster Compressed Suffix Trees for Repetitive Text Collections

Relative Lempel-Ziv Compression of Suffix Arrays

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation