Relative Lempel-Ziv Compression of Suffix Arrays

Puglisi, Simon J.; Zhukova, Bella

doi:10.1007/978-3-030-59212-7_7

Simon J. Puglisi¹⁰ &
Bella Zhukova¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12303))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

568 Accesses
4 Citations

Abstract

We show that a combination of differential encoding, random sampling, and relative Lempel-Ziv (RLZ) parsing is effective for compressing suffix arrays, while simultaneously allowing very fast decompression of arbitrary suffix array intervals, facilitating pattern matching. The resulting text index, while somewhat larger (5-10x) than the recent r-index of Gagie, Navarro, and Prezza (Proc. SODA ’18)—still provides significant compression, and allows pattern location queries to be answered more than two orders of magnitude faster in practice.

This research is supported by Academy of Finland through grant 319454.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Suffix sorting via matching statistics

Article Open access 12 March 2024

Faster Compressed Suffix Trees for Repetitive Text Collections

Lempel–Ziv-78 Compressed String Dictionaries

Article 26 July 2017

Notes

1.
The only implementation of cdawg works only for strings on {a,c,g,t}.
2.
We also tried unsuccessfully to include the Locally Compressed Suffix Array (LCSA) of Gonzalez, Navarro, and Farrada [12], which is based on differential encoding of the SA and RePair grammar compression. After expending significant effort attempting to get their code to work we discovered—in communication with the authors [4]—that our failure was due to known bugs in the (dated) LCSA codebase.

References

Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_3
Chapter Google Scholar
Cáceres, M., Puglisi, S.J., Zhukova, B.: Fast indexes for gapped pattern matching. In: Chatzigeorgiou, A., Dondi, R., Herodotou, H., Kapoutsis, C., Manolopoulos, Y., Papadopoulos, G.A., Sikora, F. (eds.) SOFSEM 2020. LNCS, vol. 12011, pp. 493–504. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38919-2_40
Chapter Google Scholar
Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27(21), 2979–2986 (2011)
Article Google Scholar
Farrada, H.: Personal Communication
Google Scholar
Farruggia, A., Gagie, T., Navarro, G., Puglisi, S.J., Sirén, J.: Relative suffix trees. Comput. J. 61(5), 773–788 (2018)
Article MathSciNet Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, Redondo Beach, California, USA, 12–14 November 2000, pp. 390–398. IEEE Computer Society (2000)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of SODA, pp. 1459–1477. ACM-SIAM (2018)
Google Scholar
Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), 2:1–2:54 (2020)
Article MathSciNet Google Scholar
Gagie, T., Puglisi, S.J., Valenzuela, D.: Analyzing relative Lempel-Ziv reference construction. In: Inenaga, S., Sadakane, K., Sakai, T. (eds.) SPIRE 2016. LNCS, vol. 9954, pp. 160–165. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46049-9_16
Chapter Google Scholar
González, R., Navarro, G.: Compressed text indexes with fast locate. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 216–227. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73437-6_23
Chapter Google Scholar
González, R., Navarro, G., Ferrada, H.: Locally compressed suffix arrays. ACM J. Exp. Algorithmics, 19(1), article 1 (2014)
Google Scholar
Hoobin, C., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endow. 5(3), 265–273 (2011)
Article Google Scholar
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16321-0_20
Chapter MATH Google Scholar
Larsson, N.J., Moffat, A.: Offline dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)
Article Google Scholar
Liao, K., Petri, M., Moffat, A., Wirth, A.: Effective construction of relative Lempel-Ziv dictionaries. In: Proceedings of 25th International Conference on the World Wide Web (WWW), pp. 807–816 (2016)
Google Scholar
Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, Cambridge (2015)
Book Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet Google Scholar
Tong, J., Wirth, A., Zobel, J.: Compact auxiliary dictionaries for incremental compression of large repositories. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, 3–7 November 2014, pp. 1629–1638. ACM (2014)
Google Scholar
Tong, J., Wirth, A., Zobel, J.: Principled dictionary pruning for low-memory corpus compression. In: The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2014, Gold Coast, QLD, Australia, 06–11 July 2014, pp. 283–292. ACM (2014)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MathSciNet Google Scholar

Download references

Acknowledgements

Our thanks go to Héctor Farrada, Nicola Prezza, and Daniel Valenzuela for prompt responses to our queries.

Author information

Authors and Affiliations

Department of Computer Science, Helsinki Institute for Information Technology (HIIT), University of Helsinki, Helsinki, Finland
Simon J. Puglisi & Bella Zhukova

Authors

Simon J. Puglisi
View author publications
You can also search for this author in PubMed Google Scholar
Bella Zhukova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bella Zhukova .

Editor information

Editors and Affiliations

CISE Department, University of Florida, Gainesville, FL, USA
Christina Boucher
Department of Computer Science, University of Central Florida, Orlando, FL, USA
Sharma V. Thankachan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Puglisi, S.J., Zhukova, B. (2020). Relative Lempel-Ziv Compression of Suffix Arrays. In: Boucher, C., Thankachan, S.V. (eds) String Processing and Information Retrieval. SPIRE 2020. Lecture Notes in Computer Science(), vol 12303. Springer, Cham. https://doi.org/10.1007/978-3-030-59212-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-59212-7_7
Published: 17 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59211-0
Online ISBN: 978-3-030-59212-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Relative Lempel-Ziv Compression of Suffix Arrays

Abstract

Access this chapter

Similar content being viewed by others

Suffix sorting via matching statistics

Faster Compressed Suffix Trees for Repetitive Text Collections

Lempel–Ziv-78 Compressed String Dictionaries

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Relative Lempel-Ziv Compression of Suffix Arrays

Abstract

Access this chapter

Similar content being viewed by others

Suffix sorting via matching statistics

Faster Compressed Suffix Trees for Repetitive Text Collections

Lempel–Ziv-78 Compressed String Dictionaries

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation