Abstract
The ability to induce short descriptions of, i.e. compressing, a wide class of data is essential for any system exhibiting general intelligence. In all generality, it is proven that incremental compression – extracting features of data strings and continuing to compress the residual data variance – leads to a time complexity superior to universal search if the strings are incrementally compressible. It is further shown that such a procedure breaks up the shortest description into a set of pairwise orthogonal features in terms of algorithmic information.
Keywords
 Incremental compression
 Data compression
 Algorithmic complexity
 Universal induction
 Universal search
 Feature extraction
A. Franz—Independent researcher
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
 1.
Note that the \(\left\langle \cdot ,\cdot \right\rangle \)map is defined with \(\left\langle z,\epsilon \right\rangle \equiv z\), hence \(f_{k}(\epsilon )=U\left( \left\langle f_{k},\epsilon \right\rangle \right) =U(f_{k})\), so that \(f_{k}\) acts as a usual string in the universal machine.
 2.
It is not difficult to see that the “\(\ll \)” sign is justified for all but very few cases. After all, only for very few combinations of a set of fixed sum integers \(\sum _{i}l_{i}=L\) the sum \(\sum _{i}2^{l_{i}}\) is close to \(2^{L}\).
References
Hutter, M.: On universal prediction and Bayesian confirmation. Theor. Comput. Sci. 384(1), 33–48 (2007)
Levin, L.A.: Universal sequential search problems. Problemy Peredachi Informatsii 9(3), 115–116 (1973)
Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability, 300p. Springer, Heidelberg (2005). http://www.hutter1.net/ai/uaibook.htm
Schmidhuber, J.: Optimal ordered problem solver. Mach. Learn. 54(3), 211–254 (2004)
Potapov, A., Rodionov, S.: Making universal induction efficient by specialization. In: Goertzel, B., Orseau, L., Snaider, J. (eds.) AGI 2014. LNCS, vol. 8598, pp. 133–142. Springer, Heidelberg (2014)
Franz, A.: Artificial general intelligence through recursive data compression and grounded reasoning: a position paper. CoRR, abs/1506.04366 (2015). http://arXiv.org/abs/1506.04366
Franz, A.: Toward tractable universal induction through recursive program learning. In: Bieger, J., Goertzel, B., Potapov, A. (eds.) AGI 2015. LNCS, vol. 9205, pp. 251–260. Springer, Heidelberg (2015)
Li, M., Vitányi, P.M.: An Introduction to Kolmogorov Complexity and Its Applications. Texts in Computer Science. Springer, New York (2009)
Acknowledgements
I would like to express my gratitude to Alexey Potapov and Alexander Priamikov for proof reading and helpful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proofs
A Proofs
Proof
(Lemma 1 ).

1.
Suppose there is a shorter program g with \(l(g)<l(f^{*})\), that generates x with the help of p: \(U\left( \left\langle g,p\right\rangle \right) =x\). Then there is also a descriptive map \(g'\equiv f'^{*},\) that computes p from x and \(l(g'(x))=l(f'^{*}(x))<l(x)l(f^{*})<l(x)l(g)\). Therefore, g is a feature of x by definition, which conflicts with \(f^{*}\) already being the shortest feature.

2.
Suppose there is a shorter program \(g'\) with \(l(g')<l(f'^{*})\), that generates p with the help of x: \(U\left( \left\langle g',x\right\rangle \right) =g'(x)=p\). Then \(g'\in D_{f^{*}}(x)\) since \(f^{*}(g'(x))=f^{*}(p)=x\) and \(l(g'(x))=l(p)<l(x)l(f^{*})\) by construction of \(f'^{*}\). However, by Eq. (4.3) \(f'^{*}\) is already the shortest program able to do so, contradicting the assumption. \(\square \)
Proof
(Theorem 1 ). From Lemma 1 we know \(l(f^{*})=K(xp)\), with \(p=f'^{*}(x)\). In all generality, for the shortest program q computing x, \(l(q)=K(x)=K(q)+O(1)\) holds, since it is incompressible (q would not be the shortest program otherwise). For shortest features, the conditional case is also true: \(K(xp)=K(f^{*}p)+O(1)\). After all, if there was a shorter program g, \(l(g)<l(f^{*})\), that computed \(f^{*}\) with the help of p, it could also go on to compute x from \(f^{*}\) and p, leading to \(K(xp)\le l(g)+O(1)<l(f^{*})+O(1)\), which contradicts \(l(f^{*})=K(xp)\).
Further, for any two strings \(K(f^{*}p)\le K(f^{*})\), since p can only help in compressing \(f^{*}\). Putting it all together leads to \(l(f^{*})=K(xp)=K(f^{*}p)+O(1)\le K(f^{*})+O(1)\). On the other hand, since in general \(K(f^{*})\le l(f^{*})+O(1)\) is also true, the claim \(K(f^{*})=l(f^{*})+O(1)\) follows. \(\square \)
Proof
(Theorem 2 ).

1.
Follows immediately from \(K(f^{*})=l(f^{*})+O(1)=K(xp)+O(1)=K(f^{*}p)+O(1)\).

2.
The first equality follows from Theorem 1, since we only need to read off the length of \(f^{*}\) in order to know \(K(f^{*})\) up to a constant. For the second equality, consider the symmetry of the conditional prefix complexity relation \(K(f^{*},p)=K(f^{*})+K\left( pf^{*},K(f^{*})\right) +O(1)=K(p)+K\left( f^{*}p,K(p)\right) +O(1)\) [8, Theorem 3.9.1, p. 247]. If p does not help computing a shorter \(f^{*}\), then knowing K(p) will not help either. Therefore, from (1), we obtain \(K\left( f^{*}p,K(p)\right) =K(f^{*})+O(1)\) and therefore \(K\left( pf^{*},K(f^{*})\right) =K(p)+O(1)\).

3.
In general, by [8, Theorem 3.9.1, p. 247] we can expand \(K(f^{*},p)=K(f^{*})+K\left( pf^{*},K(f^{*})\right) +O(1)\). After inserting (2) the claim follows. \(\square \)
Proof
(Theorem 3 ).

1.
Expand K(x, p) up to an additive constant:
$$\begin{aligned} K(p)+K(xp,K(p))=K(x,p)=K(x)+K(px,K(x)) \end{aligned}$$(A.1)From Lemma 1(1) and Theorem 1 we know \(K(f^{*})=K(xp)+O(1)\). Conditioning this on K(p) and using \(f^{*}\)’s independence of p and thereby of K(p) (Theorem 2(1)) we get \(K(xp,K(p))=K(f^{*}K(p))+O(1)=K(f^{*})+O(1)\). Inserting this into Eq. (A.1) and using Theorem 2(3), yields
$$\begin{aligned} K(f^{*},p)=K(p)+K(f^{*})=K(x)+K(px,K(x))+O(1) \end{aligned}$$(A.2) 
2.
Fix \(f^{*}\) and let \(P_{f^{*}}(x)\equiv \left\{ f'(x):\; f'\in D_{f^{*}}(x)\right\} \) be the set of admissible parameters computing x from \(f^{*}\). From Lemma 1(2), we know that minimizing \(l(f')\), with \(s=f'(x)\), is equivalent to minimizing K(sx), i.e. choosing a string \(p=f'^{*}(x)\in P_{f^{*}}(x)\) such that \(K(sx)\ge K(px)\) for all \(s\in P_{f^{*}}(x)\). Conditioning Eq. (A.2) on x leads to:
$$\begin{aligned} K(px)+K(f^{*}x)=K(xx)+K(px,K(x),x)=K(px,K(x)) \end{aligned}$$(A.3)up to additive constants. Since \(f^{*}\) and x are fixed, the claim \(l(f'^{*})=K(px)\propto K(px,K(x))+O(1)\) follows.

3.
It remains to show that there exists some \(p\in P_{f^{*}}(x)\) such that \(K(px,K(x))=O(1)\). After all, if it does exist, it will be identified by minimizing \(l(f')\), as implied by (2). Define \(q\equiv \text{ argmin }_{s}\left\{ l(s):\; U\left( \left\langle f^{*},U(s)\right\rangle \right) =f^{*}\left( U(s)\right) =x\right\} \) and compute \(p\equiv U(q)\). Since \(f^{*}(p)=x\), \(p\in P_{f^{*}}(x)\). Further, there is no shorter program able to compute p, since with p we can compute x given \(f^{*}\) and q is already the shortest one being able to do so, by definition. Therefore, \(l(q)=K(p)+O(1)\) and \(K(xf^{*})\le K(p)+O(1)\). Can the complexity \(K(xf^{*})\) be strictly smaller than K(p) thereby surpassing the presumably residual part in p? Let \(p'\) be such a program: \(l(p')=K(xf^{*})<K(p)+O(1)\). By definition of \(K(xf^{*})\), \(f^{*}(p')=x\). However, then we can find the shortest program \(q'\) that computes \(p'\) and we get: \(f^{*}\left( U(q')\right) =x\). Since \(l(q')\le l(p')+O(1)\), we get \(l(q')<K(p)+O(1)=l(q)+O(1)\). However, this contradicts the fact that q is already the shortest program able to compute \(f^{*}(U(q))=x\). Therefore,
$$\begin{aligned} l(q)=K(xf^{*})=K(p)+O(1) \end{aligned}$$(A.4)In order to prove \(K(px,K(x))=O(1)\) consider the following general expansion
$$\begin{aligned} K(p,xf^{*})=K(xf^{*})+K(px,K(x),f^{*})+O(1) \end{aligned}$$(A.5)Since we can compute p from q and go on to compute x given \(f^{*}\), \(l(q)=K(p,xf^{*})+O(1)\). After all, note that with Theorem 2(2), we have \(l(q)=K(p)=K(pf^{*})\le K(p,xf^{*})\) up to additive constants, but since we can compute \(\left\langle p,x\right\rangle \) given \(f^{*}\) from q, we know \(K(p,xf^{*})\le l(q)+O(1)\). Both inequalities can only be true if the equality \(l(q)=K(p,xf^{*})+O(1)\) holds. At the same time, from Eq. (A.4), \(l(q)=K(xf^{*})\) holds. Inserting this into Eq. (A.5) leads to \(K(px,K(x),f^{*})=O(1)\). Taking \(K(p)=K(pf^{*})+O(1)\) (Theorem 2(2)), and inserting the conditionals x and K(x) leads to: \(K(px,K(x))=K(px,K(x),f^{*})+O(1)=O(1)\). Since this shows that a \(p\in P_{f^{*}}(x)\) exists with the minimal value \(K(px,K(x))=O(1)\), (2) implies that it must be the same or equivalent to the one found by minimizing \(l(f')\).

4.
Conditioning Eq. (A.3) on K(x) we get \(K(px,K(x))+K(f^{*}x,K(x))=K(px,K(x))+O(1)\) from which the claim follows. \(\square \)
Proof
(Corollary 1 ). Inserting Eq. (5.2) into Eq. (5.1) proves the point. \(\square \)
Proof
(Corollary 2 ). Inserting Eq. (A.2) into Eq. (5.4) and using the incompressibility of \(f^{*}\) (Theorem 1) proves the point. \(\square \)
Proof
(Theorem 4 ). According to the definition of a feature, at a compression step the length of the parameters \(l(p_{i})<l(x)l(f_{i}^{*})\) and their complexity (Corollary 2) decreases. Since the \(f_{i}^{*}\) are incompressible themselves (Theorem 1), the parameters store the residual information about x. Therefore, at some point, only the possibility \(p_k\equiv {f'}^{*}_k(p_{k1})=\epsilon \) with \(l(f_{k}^{*})=K(p_{k1})\) remains and the compression has to stop. Expanding Corollary 2 proves the result: \(K(x)=l(f_{1}^{*})+K(p_{1})+O(1)=l(f_{1}^{*})+l(f_{2}^{*})+K(p_{2})+O(1)=\sum _{i=1}^{k}l(f_{i}^{*})+O(1)\). \(\square \)
Proof
(Theorem 5 ). Algorithmic information is defined as \(I(f_{i}^{*}:f_{j}^{*})\equiv K(f_{j}^{*})K(f_{j}^{*}f_{i}^{*})\). The case \(i=j\) is trivial, since \(K(f_{i}^{*}f_{i}^{*})=0\). If \(i>j\), then \(p_{j}=\left( f_{j+1}^{*}\circ \cdots \circ f_{i}^{*}\right) (p_{i})\), which implies that all information about \(f_{i}\) is in \(p_{j}\). But since according to Theorem 2(1), \(K(f_{j}^{*}p_{j})=K(f_{j}^{*})+O(1)\) we conclude that \(K(f_{j}^{*}f_{i}^{*})=K(f_{j}^{*})+O(1)\). If \(i<j\), then we know that \(f_{j}^{*}\) in no way contributed to the construction of \(p_{i}\) further in the compression process. Hence \(K(f_{j}^{*}f_{i}^{*})=K(f_{j}^{*})\). \(\square \)
Proof
(Theorem 6 ). Let \(p\equiv f'^{*}(x)\). Further, from Lemma 1 we know that \(K(xp)=l(f^{*})\) and \(K(px)=l(f'^{*})\). Using Corollary 2, the difference in algorithmic information is \(I(p:x)I(x:p)=K(x)K(xp)K(p)+K(px)=l(f'^{*})+O(1)\). By [8, Lemma 3.9.2, p. 250], algorithmic information is symmetric up to logarithmic terms: \(I(x:p)I(p:x)\le \log K(x)+2\log \log K(x)+\log K(p)+2\log \log K(p)+O(1)\). Since x is computed from \(f^{*}\) and p, we have \(K(p)\le K(x)\). Putting everything together leads to \(l(f'^{*})\le 2\log K(x)+4\log \log K(x)+O(1)\). The second inequality follows from \(K(x)\le l(x)+O(1)\). \(\square \)
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Franz, A. (2016). Some Theorems on Incremental Compression. In: Steunebrink, B., Wang, P., Goertzel, B. (eds) Artificial General Intelligence. AGI 2016. Lecture Notes in Computer Science(), vol 9782. Springer, Cham. https://doi.org/10.1007/9783319416496_8
Download citation
DOI: https://doi.org/10.1007/9783319416496_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783319416489
Online ISBN: 9783319416496
eBook Packages: Computer ScienceComputer Science (R0)