CPM 2014: Combinatorial Pattern Matching pp 182-191

# A really Simple Approximation of Smallest Grammar

• Artur Jeż
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8486)

## Abstract

We present a really simple linear-time algorithm constructing a context-free grammar of size $$\mathcal{O}(g log (N/g))$$ for the input string, where N is the size of the input string and g the size of the optimal grammar generating this string. The algorithm works for arbitrary size alphabets, but the running time is linear when the alphabet Σ of the input string can be identified with numbers from {1,…, N }. Algorithms with such an approximation guarantee and running time are known, however all of them were non-trivial and their analyses involved. The here presented algorithm computes the LZ77 factorisation (of size l) and transforms it in phases to a grammar. In each phase it maintains an LZ77-like factorisation of the word with at most l factors as well as additional $$\mathcal{O}(l)$$ letters. In one phase in a greedy way (by a left-to-right sweep) we choose a set of pairs of consecutive letters to be replaced with new symbols, i.e. nonterminals of the constructed grammar. We choose at least 2/3 of the letters in the word and there are $$\mathcal{O}(l)$$ many different pairs among them. Hence there are $$\mathcal{O}(log N)$$ phases, each introduces $$\mathcal{O}(l)$$ nonterminals. A more precise analysis yields a bound $$\mathcal{O}(l log(N/l))$$. As l ≤ g, this yields $$\mathcal{O}(g log(N/g))$$.

## Keywords

Grammar-based compression Construction of the smallest grammar SLP compression

## References

1. 1.
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Transactions on Information Theory 51(7), 2554–2576 (2005)
2. 2.
Jeż, A.: Approximation of grammar-based compression via recompression. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 165–176. Springer, Heidelberg (2013)
3. 3.
Jeż, A., Lohrey, M.: Approximation of smallest linear tree grammar. In: Mayr, E., Portier, N. (eds.) STACS. LIPIcs, vol. 24, pp. 445–457. Schloss Dagstuhl — Leibniz-Zentrum fuer Informatik (2014)Google Scholar
4. 4.
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time lempel-ziv factorization: Simple, fast, small. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 189–200. Springer, Heidelberg (2013)
5. 5.
Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Data Compression Conference, pp. 296–305. IEEE Computer Society (1999)Google Scholar
6. 6.
Lohrey, M.: Algorithmics on SLP-compressed strings: A survey. Groups Complexity Cryptology 4(2), 241–299 (2012)
7. 7.
Rubin, F.: Experiments in text file compression. Commun. ACM 19(11), 617–623 (1976)
8. 8.
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)
9. 9.
Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algorithms 3(2-4), 416–430 (2005)