Skip to main content

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9954))

Included in the following conference series:

Abstract

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.’s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals’ genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.’s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.’s implementation with comparable random-access times.

Supported by the Academy of Finland through grants 258308, 268324, 284598 and 285221 and by the Wellcome Trust grant 098051. Parts of this work were done during the second author’s visit to the University of Helsinki and during the third author’s visits to Illumina Cambridge Ltd. and the University of A Coruña, Spain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  2. Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  3. Ziv, J., Merhav, N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Trans. Inf. Theor. 39, 1270–1279 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  4. Hoobin, C., Puglisi, S.J., Zobel, J.: Sample selection for dictionary-based corpus compression. In: Proceedings of SIGIR, pp. 1137–1138 (2011)

    Google Scholar 

  5. Hoobin, C., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB 5, 265–273 (2011)

    Article  Google Scholar 

  6. Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27, 2979–2986 (2011)

    Article  Google Scholar 

  7. Ferrada, H., Gagie, T., Gog, S., Puglisi, S.J.: Relative Lempel-Ziv with constant-time random access. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 13–17. Springer, Heidelberg (2014)

    Google Scholar 

  8. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Hybrid compression of bitvectors for the FM-index. In: Proceedings of DCC, pp. 302–311 (2014)

    Google Scholar 

  9. Deorowicz, S., Danek, A., Niemiec, M.: GDC2: compression of large collections of genomes. Sci. Rep. 5, 1–12 (2015)

    Article  Google Scholar 

  10. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  11. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  12. Belazzougui, D., Gagie, T., Gog, S., Manzini, G., Sirén, J.: Relative FM-indexes. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 52–64. Springer, Heidelberg (2014)

    Google Scholar 

  13. Boucher, C., Bowe, A., Gagie, T., Manzini, G., Sirén, J.: Relative select. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 149–155. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  14. Léonard, M., Mouchard, L., Salson, M.: On the number of elements to reorder when updating a suffix array. J. Discrete Algorithms 11, 87–99 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  15. Gagie, T., Navarro, G., Puglisi, S.J., Sirén, J.: Relative compressed suffix trees. Technical report 1508.02550 (2015). arxiv.org

  16. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of ALENEX (2007)

    Google Scholar 

  17. Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3, 43 (2007)

    Article  MathSciNet  Google Scholar 

  18. Farruggia, A., Ferragina, P., Venturini, R.: Bicriteria data compression. In: Proceedings of SODA, pp. 1582–1595 (2014)

    Google Scholar 

  19. Farruggia, A., Ferragina, P., Venturini, R.: Bicriteria data compression: efficient and usable. In: Schulz, A.S., Wagner, D. (eds.) ESA 2014. LNCS, vol. 8737, pp. 406–417. Springer, Heidelberg (2014)

    Google Scholar 

  20. Brudno, M., Malde, S., Poliakov, A., Do, C.B., Couronne, O., Dubchak, I., Batzoglou, S.: Glocal alignment: finding rearrangements during alignment. In: Proceedings of ISMB, pp. 54–62 (2003)

    Google Scholar 

  21. Kubincová, P.: Mapping between genomes. Bachelor thesis, Comenius University, Slovakia Supervised by Broňa Brejová (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Travis Gagie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Cox, A.J., Farruggia, A., Gagie, T., Puglisi, S.J., Sirén, J. (2016). RLZAP: Relative Lempel-Ziv with Adaptive Pointers. In: Inenaga, S., Sadakane, K., Sakai, T. (eds) String Processing and Information Retrieval. SPIRE 2016. Lecture Notes in Computer Science(), vol 9954. Springer, Cham. https://doi.org/10.1007/978-3-319-46049-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46049-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46048-2

  • Online ISBN: 978-3-319-46049-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics