RLZAP: Relative Lempel-Ziv with Adaptive Pointers

  • Anthony J. Cox
  • Andrea Farruggia
  • Travis Gagie
  • Simon J. Puglisi
  • Jouni Sirén
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9954)

Abstract

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.’s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals’ genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.’s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.’s implementation with comparable random-access times.

References

  1. 1.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 337–343 (1977)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  3. 3.
    Ziv, J., Merhav, N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Trans. Inf. Theor. 39, 1270–1279 (1993)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Hoobin, C., Puglisi, S.J., Zobel, J.: Sample selection for dictionary-based corpus compression. In: Proceedings of SIGIR, pp. 1137–1138 (2011)Google Scholar
  5. 5.
    Hoobin, C., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB 5, 265–273 (2011)CrossRefGoogle Scholar
  6. 6.
    Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27, 2979–2986 (2011)CrossRefGoogle Scholar
  7. 7.
    Ferrada, H., Gagie, T., Gog, S., Puglisi, S.J.: Relative Lempel-Ziv with constant-time random access. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 13–17. Springer, Heidelberg (2014)Google Scholar
  8. 8.
    Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Hybrid compression of bitvectors for the FM-index. In: Proceedings of DCC, pp. 302–311 (2014)Google Scholar
  9. 9.
    Deorowicz, S., Danek, A., Niemiec, M.: GDC2: compression of large collections of genomes. Sci. Rep. 5, 1–12 (2015)CrossRefGoogle Scholar
  10. 10.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 552–581 (2005)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)Google Scholar
  12. 12.
    Belazzougui, D., Gagie, T., Gog, S., Manzini, G., Sirén, J.: Relative FM-indexes. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 52–64. Springer, Heidelberg (2014)Google Scholar
  13. 13.
    Boucher, C., Bowe, A., Gagie, T., Manzini, G., Sirén, J.: Relative select. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 149–155. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  14. 14.
    Léonard, M., Mouchard, L., Salson, M.: On the number of elements to reorder when updating a suffix array. J. Discrete Algorithms 11, 87–99 (2012)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Gagie, T., Navarro, G., Puglisi, S.J., Sirén, J.: Relative compressed suffix trees. Technical report 1508.02550 (2015). arxiv.org
  16. 16.
    Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of ALENEX (2007)Google Scholar
  17. 17.
    Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3, 43 (2007)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Farruggia, A., Ferragina, P., Venturini, R.: Bicriteria data compression. In: Proceedings of SODA, pp. 1582–1595 (2014)Google Scholar
  19. 19.
    Farruggia, A., Ferragina, P., Venturini, R.: Bicriteria data compression: efficient and usable. In: Schulz, A.S., Wagner, D. (eds.) ESA 2014. LNCS, vol. 8737, pp. 406–417. Springer, Heidelberg (2014)Google Scholar
  20. 20.
    Brudno, M., Malde, S., Poliakov, A., Do, C.B., Couronne, O., Dubchak, I., Batzoglou, S.: Glocal alignment: finding rearrangements during alignment. In: Proceedings of ISMB, pp. 54–62 (2003)Google Scholar
  21. 21.
    Kubincová, P.: Mapping between genomes. Bachelor thesis, Comenius University, Slovakia Supervised by Broňa Brejová (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Anthony J. Cox
    • 1
  • Andrea Farruggia
    • 2
  • Travis Gagie
    • 3
    • 4
  • Simon J. Puglisi
    • 3
    • 4
  • Jouni Sirén
    • 5
  1. 1.Illumina Cambridge Ltd.CambridgeUK
  2. 2.University of PisaPisaItaly
  3. 3.Helsinki Institute for Information TechnologyEspooFinland
  4. 4.University of HelsinkiHelsinkiFinland
  5. 5.Wellcome Trust Sanger InstituteHinxtonUK

Personalised recommendations