Chapter

String Processing and Information Retrieval

Volume 6393 of the series Lecture Notes in Computer Science pp 201-206

Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval

  • Shanika KuruppuAffiliated withNational ICT Australia, Department of Computer Science & Software Engineering, University of Melbourne
  • , Simon J. PuglisiAffiliated withSchool of Computer Science and Information Technology, Royal Melbourne Institute of Technology
  • , Justin ZobelAffiliated withNational ICT Australia, Department of Computer Science & Software Engineering, University of Melbourne

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just \(nH_k(T) + s\log n + s\log \frac{N}{s} + O(s)\) bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires \(\O(\ell + \log n)\) time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.