Sampling the Suffix Array with Minimizers

Grabowski, Szymon; Raniszewski, Marcin

doi:10.1007/978-3-319-23826-5_28

Szymon Grabowski¹⁶ &
Marcin Raniszewski¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1129 Accesses
10 Citations

Abstract

Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared to the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which seems more convenient in practice. Experiments show that our algorithm achieves competitive time-space tradeoffs on most standard benchmark data. As a side result, we show that \(n'\) arbitrarily selected suffixes from a text of length n, where \(n' < n\), over an integer alphabet, can be sorted in O(n) time using \(O(n')\) words of space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alstrup, S., Brodal, G.S., Rauhe, T.: Pattern matching in dynamic texts. In: SODA, pp. 819–828. Society for Industrial and Applied Mathematics (2000)
Google Scholar
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. Journal of Computational Biology 22(5), 336–352 (2015)
Article MathSciNet Google Scholar
Claude, F., Navarro, G., Peltola, H., Salmela, L., Tarhio, J.: String matching with alphabet sampling. Journal of Discrete Algorithms 11, 37–50 (2012)
Article MathSciNet MATH Google Scholar
Crescenzi, P., Del Lungo, A., Grossi, R., Lodi, E., Pagli, L., Rossi, G.: Text sparsification via local maxima. In: Kapoor, S., Prasad, S. (eds.) FST TCS 2000. LNCS, vol. 1974, pp. 290–301. Springer, Heidelberg (2000)
Chapter Google Scholar
Crescenzi, P., Lungo, A.D., Grossi, R., Lodi, E., Pagli, L., Rossi, G.: Text sparsification via local maxima. Theoretical Computer Science 1–3(304), 341–364 (2003)
Article MathSciNet MATH Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Article Google Scholar
Ferragina, P., Fischer, J.: Suffix arrays on words. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 328–339. Springer, Heidelberg (2007)
Chapter Google Scholar
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics 13, article 12, 30 (2009)
Google Scholar
Fischer, J., Gagie, T., Gawrychowski, P., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. CoRR, abs/1504.06647 (2015)
Google Scholar
Gog, S., Petri, M.: Optimized succinct data structures for massive data. Software-Practice and Experience 44(11), 1287–1314 (2014)
Article Google Scholar
Grabowski, S., Deorowicz, S., Roguski, Ł.: Disk-based compression of data from genome sequencing. Bioinformatics 31(9), 1389–1395 (2015)
Article Google Scholar
Grabowski, S., Raniszewski, M.: Two simple full-text indexes based on the suffix array. In: Holub, J., Zdárek, J. (eds.) PSC, pp. 179–191. Faculty of Information Technology, Czech Technical University in Prague, Department of Theoretical Computer Science (2014)
Google Scholar
Grabowski, S., Raniszewski, M.: Two simple full-text indexes based on the suffix array (2015). Submitted to a journal
Google Scholar
Han,Y.: Deterministic sorting in \({O}(n \log \log n)\) time and linear space. In: STOC, pp. 602–608. ACM (2002)
Google Scholar
Tomohiro, I., Kärkkäinen, J., Kempa, D.: Faster sparse suffix sorting. In: Mayr, E.W., Portier, N. (eds.) STACS. LIPIcs, vol. 25, pp. 386–396. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2014)
Google Scholar
Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)
Chapter Google Scholar
Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. In: VLDB, pp. 169–180. VLDB Endowment (2013)
Google Scholar
Mehlhorn, K., Sundar, R., Uhrig, C.: Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica 17(2), 183–198 (1997)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)
Google Scholar
Puglisi, S.J., Smyth, W.F., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 122–133. Springer, Heidelberg (2006)
Chapter Google Scholar
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Article Google Scholar
Sahinalp, S.C., Vishkin, U.: Symmetry breaking for suffix tree construction. In: STOC, pp. 300–309. ACM (1994)
Google Scholar
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15(3), R46 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90–924, Łódź, Poland
Szymon Grabowski & Marcin Raniszewski

Authors

Szymon Grabowski
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Raniszewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Szymon Grabowski .

Editor information

Editors and Affiliations

King's College London, London, United Kingdom
Costas Iliopoulos
University of Helsinki, Helsinki, Finland
Simon Puglisi
University College London, London, United Kingdom
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grabowski, S., Raniszewski, M. (2015). Sampling the Suffix Array with Minimizers. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-23826-5_28
Published: 05 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics