Abstract
One of the most famous lossless data-compression schemes is the one introduced by Lempel and Ziv in the late 1970s, and indeed many (non-)commercial programs are currently based on it—like gzip, zip, pkzip, arj, rar, just to cite a few.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Recently, Crochemore et al. (2008) showed how to achieve the optimal \(O(n)\) time and space when the alphabet has size \(O(n)\) and the window is unbounded, i.e., \(M=n\).
- 2.
Gzip home page http://www.gzip.org.
- 3.
Bzip2 home page http://www.bzip.org/.
- 4.
In case of a larger alphabet, our algorithms are still correct but we need to add the term \(T_{sort}(n, \sigma )\) to their time complexities, which denotes the time required to sort/remap all distinct symbols of \(T\) into the range \([n]\).
- 5.
Notice that the node \(u_j\) can be identified during the rightward scanning of \(T\) as usually done in LZ77-parsing, taking \(O(n)\) time for all identified phrases.
- 6.
Recall the variant of LZ77 we are considering in this chapter, which uses just a pair of integers per phrase, and thus drops the char following that phrase in \(T\).
- 7.
Notice that there may be several different candidate positions \(p\) from which we can copy the substring \(T[i:j-1]\). We can arbitrarily choose any position among the ones whose distance from \(i\) is encodable with the smallest number of bits (namely, \(|f(d_{i,j})|\) bits is minimized).
- 8.
Recall that \(c(v_i,v_j) = |f(d_{i,j})| + |g(\ell _{i,j})|\), if the edge does exist, otherwise we set \(c(v_i,v_j) = +\infty \).
- 9.
Observe that \(|FS(v)| \le Q(f,n) + Q(g,n)\), for any vertex \(v\) in \(\widetilde{\mathcal{G}}(T)\) (Lemma 4.2).
- 10.
Observe that there may be several leaves having these characteristics. We can arbitrarily choose one of them because they denote copies of the same phrase that can be encoded with the same number of bits for the length and for the distance (i.e., \(c(I_k)\) bits).
- 11.
The value of \(\mathtt{mp}[a(u)]\) can be arbitrarily set to \(\mathtt{min}(u)\) or \(\mathtt{max}(u)\) whenever both \(\mathtt{min}(u)\) and \(\mathtt{max}(u)\) belong to \(W_{a(u)}\).
- 12.
Observe that we obtain \(\mathtt{mp}[11] = 10\) by setting \(\mathtt{mp}[a(u)] = a(\mathtt{parent}(u))\) at the node with \(a =11\).
- 13.
Algorithms Rightmost-LZ77 and BitOptimal-LZ77 encode copy-distances and lengths by using a variant of Rice codes in which we have not just one bucketing of size \(2^k\), rather we have a series of buckets of increasing size, fixed in advance.
- 14.
Lzma2 home page http://7-zip.org/.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2014 Atlantis Press and the authors
About this chapter
Cite this chapter
Venturini, R. (2014). Bit-Complexity of Lempel-Ziv Compression. In: Compressed Data Structures for Strings. Atlantis Studies in Computing, vol 4. Atlantis Press, Paris. https://doi.org/10.2991/978-94-6239-033-1_4
Download citation
DOI: https://doi.org/10.2991/978-94-6239-033-1_4
Published:
Publisher Name: Atlantis Press, Paris
Print ISBN: 978-94-6239-032-4
Online ISBN: 978-94-6239-033-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)