Skip to main content

Sampling the Suffix Array with Minimizers

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Included in the following conference series:

  • International Symposium on String Processing and Information Retrieval

Abstract

Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared to the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which seems more convenient in practice. Experiments show that our algorithm achieves competitive time-space tradeoffs on most standard benchmark data. As a side result, we show that \(n'\) arbitrarily selected suffixes from a text of length n, where \(n' < n\), over an integer alphabet, can be sorted in O(n) time using \(O(n')\) words of space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alstrup, S., Brodal, G.S., Rauhe, T.: Pattern matching in dynamic texts. In: SODA, pp. 819–828. Society for Industrial and Applied Mathematics (2000)

    Google Scholar 

  2. Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. Journal of Computational Biology 22(5), 336–352 (2015)

    Article  MathSciNet  Google Scholar 

  3. Claude, F., Navarro, G., Peltola, H., Salmela, L., Tarhio, J.: String matching with alphabet sampling. Journal of Discrete Algorithms 11, 37–50 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  4. Crescenzi, P., Del Lungo, A., Grossi, R., Lodi, E., Pagli, L., Rossi, G.: Text sparsification via local maxima. In: Kapoor, S., Prasad, S. (eds.) FST TCS 2000. LNCS, vol. 1974, pp. 290–301. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  5. Crescenzi, P., Lungo, A.D., Grossi, R., Lodi, E., Pagli, L., Rossi, G.: Text sparsification via local maxima. Theoretical Computer Science 1–3(304), 341–364 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  6. Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)

    Article  Google Scholar 

  7. Ferragina, P., Fischer, J.: Suffix arrays on words. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 328–339. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  8. Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics 13, article 12, 30 (2009)

    Google Scholar 

  9. Fischer, J., Gagie, T., Gawrychowski, P., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. CoRR, abs/1504.06647 (2015)

    Google Scholar 

  10. Gog, S., Petri, M.: Optimized succinct data structures for massive data. Software-Practice and Experience 44(11), 1287–1314 (2014)

    Article  Google Scholar 

  11. Grabowski, S., Deorowicz, S., Roguski, Ł.: Disk-based compression of data from genome sequencing. Bioinformatics 31(9), 1389–1395 (2015)

    Article  Google Scholar 

  12. Grabowski, S., Raniszewski, M.: Two simple full-text indexes based on the suffix array. In: Holub, J., Zdárek, J. (eds.) PSC, pp. 179–191. Faculty of Information Technology, Czech Technical University in Prague, Department of Theoretical Computer Science (2014)

    Google Scholar 

  13. Grabowski, S., Raniszewski, M.: Two simple full-text indexes based on the suffix array (2015). Submitted to a journal

    Google Scholar 

  14. Han,Y.: Deterministic sorting in \({O}(n \log \log n)\) time and linear space. In: STOC, pp. 602–608. ACM (2002)

    Google Scholar 

  15. Tomohiro, I., Kärkkäinen, J., Kempa, D.: Faster sparse suffix sorting. In: Mayr, E.W., Portier, N. (eds.) STACS. LIPIcs, vol. 25, pp. 386–396. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2014)

    Google Scholar 

  16. Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  17. Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. In: VLDB, pp. 169–180. VLDB Endowment (2013)

    Google Scholar 

  18. Mehlhorn, K., Sundar, R., Uhrig, C.: Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica 17(2), 183–198 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  19. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)

    Google Scholar 

  20. Puglisi, S.J., Smyth, W.F., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 122–133. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  21. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)

    Article  Google Scholar 

  22. Sahinalp, S.C., Vishkin, U.: Symmetry breaking for suffix tree construction. In: STOC, pp. 300–309. ACM (1994)

    Google Scholar 

  23. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15(3), R46 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Szymon Grabowski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Grabowski, S., Raniszewski, M. (2015). Sampling the Suffix Array with Minimizers. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23826-5_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23825-8

  • Online ISBN: 978-3-319-23826-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics