Abstract
The ability to induce short descriptions of, i.e. compressing, a wide class of data is essential for any system exhibiting general intelligence. In all generality, it is proven that incremental compression – extracting features of data strings and continuing to compress the residual data variance – leads to a time complexity superior to universal search if the strings are incrementally compressible. It is further shown that such a procedure breaks up the shortest description into a set of pairwise orthogonal features in terms of algorithmic information.
Keywords
- Incremental compression
- Data compression
- Algorithmic complexity
- Universal induction
- Universal search
- Feature extraction
A. Franz—Independent researcher
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Note that the \(\left\langle \cdot ,\cdot \right\rangle \)-map is defined with \(\left\langle z,\epsilon \right\rangle \equiv z\), hence \(f_{k}(\epsilon )=U\left( \left\langle f_{k},\epsilon \right\rangle \right) =U(f_{k})\), so that \(f_{k}\) acts as a usual string in the universal machine.
- 2.
It is not difficult to see that the “\(\ll \)” sign is justified for all but very few cases. After all, only for very few combinations of a set of fixed sum integers \(\sum _{i}l_{i}=L\) the sum \(\sum _{i}2^{l_{i}}\) is close to \(2^{L}\).
References
Hutter, M.: On universal prediction and Bayesian confirmation. Theor. Comput. Sci. 384(1), 33–48 (2007)
Levin, L.A.: Universal sequential search problems. Problemy Peredachi Informatsii 9(3), 115–116 (1973)
Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability, 300p. Springer, Heidelberg (2005). http://www.hutter1.net/ai/uaibook.htm
Schmidhuber, J.: Optimal ordered problem solver. Mach. Learn. 54(3), 211–254 (2004)
Potapov, A., Rodionov, S.: Making universal induction efficient by specialization. In: Goertzel, B., Orseau, L., Snaider, J. (eds.) AGI 2014. LNCS, vol. 8598, pp. 133–142. Springer, Heidelberg (2014)
Franz, A.: Artificial general intelligence through recursive data compression and grounded reasoning: a position paper. CoRR, abs/1506.04366 (2015). http://arXiv.org/abs/1506.04366
Franz, A.: Toward tractable universal induction through recursive program learning. In: Bieger, J., Goertzel, B., Potapov, A. (eds.) AGI 2015. LNCS, vol. 9205, pp. 251–260. Springer, Heidelberg (2015)
Li, M., Vitányi, P.M.: An Introduction to Kolmogorov Complexity and Its Applications. Texts in Computer Science. Springer, New York (2009)
Acknowledgements
I would like to express my gratitude to Alexey Potapov and Alexander Priamikov for proof reading and helpful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proofs
A Proofs
Proof
(Lemma 1 ).
-
1.
Suppose there is a shorter program g with \(l(g)<l(f^{*})\), that generates x with the help of p: \(U\left( \left\langle g,p\right\rangle \right) =x\). Then there is also a descriptive map \(g'\equiv f'^{*},\) that computes p from x and \(l(g'(x))=l(f'^{*}(x))<l(x)-l(f^{*})<l(x)-l(g)\). Therefore, g is a feature of x by definition, which conflicts with \(f^{*}\) already being the shortest feature.
-
2.
Suppose there is a shorter program \(g'\) with \(l(g')<l(f'^{*})\), that generates p with the help of x: \(U\left( \left\langle g',x\right\rangle \right) =g'(x)=p\). Then \(g'\in D_{f^{*}}(x)\) since \(f^{*}(g'(x))=f^{*}(p)=x\) and \(l(g'(x))=l(p)<l(x)-l(f^{*})\) by construction of \(f'^{*}\). However, by Eq. (4.3) \(f'^{*}\) is already the shortest program able to do so, contradicting the assumption. \(\square \)
Proof
(Theorem 1 ). From Lemma 1 we know \(l(f^{*})=K(x|p)\), with \(p=f'^{*}(x)\). In all generality, for the shortest program q computing x, \(l(q)=K(x)=K(q)+O(1)\) holds, since it is incompressible (q would not be the shortest program otherwise). For shortest features, the conditional case is also true: \(K(x|p)=K(f^{*}|p)+O(1)\). After all, if there was a shorter program g, \(l(g)<l(f^{*})\), that computed \(f^{*}\) with the help of p, it could also go on to compute x from \(f^{*}\) and p, leading to \(K(x|p)\le l(g)+O(1)<l(f^{*})+O(1)\), which contradicts \(l(f^{*})=K(x|p)\).
Further, for any two strings \(K(f^{*}|p)\le K(f^{*})\), since p can only help in compressing \(f^{*}\). Putting it all together leads to \(l(f^{*})=K(x|p)=K(f^{*}|p)+O(1)\le K(f^{*})+O(1)\). On the other hand, since in general \(K(f^{*})\le l(f^{*})+O(1)\) is also true, the claim \(K(f^{*})=l(f^{*})+O(1)\) follows. \(\square \)
Proof
(Theorem 2 ).
-
1.
Follows immediately from \(K(f^{*})=l(f^{*})+O(1)=K(x|p)+O(1)=K(f^{*}|p)+O(1)\).
-
2.
The first equality follows from Theorem 1, since we only need to read off the length of \(f^{*}\) in order to know \(K(f^{*})\) up to a constant. For the second equality, consider the symmetry of the conditional prefix complexity relation \(K(f^{*},p)=K(f^{*})+K\left( p|f^{*},K(f^{*})\right) +O(1)=K(p)+K\left( f^{*}|p,K(p)\right) +O(1)\) [8, Theorem 3.9.1, p. 247]. If p does not help computing a shorter \(f^{*}\), then knowing K(p) will not help either. Therefore, from (1), we obtain \(K\left( f^{*}|p,K(p)\right) =K(f^{*})+O(1)\) and therefore \(K\left( p|f^{*},K(f^{*})\right) =K(p)+O(1)\).
-
3.
In general, by [8, Theorem 3.9.1, p. 247] we can expand \(K(f^{*},p)=K(f^{*})+K\left( p|f^{*},K(f^{*})\right) +O(1)\). After inserting (2) the claim follows. \(\square \)
Proof
(Theorem 3 ).
-
1.
Expand K(x, p) up to an additive constant:
$$\begin{aligned} K(p)+K(x|p,K(p))=K(x,p)=K(x)+K(p|x,K(x)) \end{aligned}$$(A.1)From Lemma 1(1) and Theorem 1 we know \(K(f^{*})=K(x|p)+O(1)\). Conditioning this on K(p) and using \(f^{*}\)’s independence of p and thereby of K(p) (Theorem 2(1)) we get \(K(x|p,K(p))=K(f^{*}|K(p))+O(1)=K(f^{*})+O(1)\). Inserting this into Eq. (A.1) and using Theorem 2(3), yields
$$\begin{aligned} K(f^{*},p)=K(p)+K(f^{*})=K(x)+K(p|x,K(x))+O(1) \end{aligned}$$(A.2) -
2.
Fix \(f^{*}\) and let \(P_{f^{*}}(x)\equiv \left\{ f'(x):\; f'\in D_{f^{*}}(x)\right\} \) be the set of admissible parameters computing x from \(f^{*}\). From Lemma 1(2), we know that minimizing \(l(f')\), with \(s=f'(x)\), is equivalent to minimizing K(s|x), i.e. choosing a string \(p=f'^{*}(x)\in P_{f^{*}}(x)\) such that \(K(s|x)\ge K(p|x)\) for all \(s\in P_{f^{*}}(x)\). Conditioning Eq. (A.2) on x leads to:
$$\begin{aligned} K(p|x)+K(f^{*}|x)=K(x|x)+K(p|x,K(x),x)=K(p|x,K(x)) \end{aligned}$$(A.3)up to additive constants. Since \(f^{*}\) and x are fixed, the claim \(l(f'^{*})=K(p|x)\propto K(p|x,K(x))+O(1)\) follows.
-
3.
It remains to show that there exists some \(p\in P_{f^{*}}(x)\) such that \(K(p|x,K(x))=O(1)\). After all, if it does exist, it will be identified by minimizing \(l(f')\), as implied by (2). Define \(q\equiv \text{ argmin }_{s}\left\{ l(s):\; U\left( \left\langle f^{*},U(s)\right\rangle \right) =f^{*}\left( U(s)\right) =x\right\} \) and compute \(p\equiv U(q)\). Since \(f^{*}(p)=x\), \(p\in P_{f^{*}}(x)\). Further, there is no shorter program able to compute p, since with p we can compute x given \(f^{*}\) and q is already the shortest one being able to do so, by definition. Therefore, \(l(q)=K(p)+O(1)\) and \(K(x|f^{*})\le K(p)+O(1)\). Can the complexity \(K(x|f^{*})\) be strictly smaller than K(p) thereby surpassing the presumably residual part in p? Let \(p'\) be such a program: \(l(p')=K(x|f^{*})<K(p)+O(1)\). By definition of \(K(x|f^{*})\), \(f^{*}(p')=x\). However, then we can find the shortest program \(q'\) that computes \(p'\) and we get: \(f^{*}\left( U(q')\right) =x\). Since \(l(q')\le l(p')+O(1)\), we get \(l(q')<K(p)+O(1)=l(q)+O(1)\). However, this contradicts the fact that q is already the shortest program able to compute \(f^{*}(U(q))=x\). Therefore,
$$\begin{aligned} l(q)=K(x|f^{*})=K(p)+O(1) \end{aligned}$$(A.4)In order to prove \(K(p|x,K(x))=O(1)\) consider the following general expansion
$$\begin{aligned} K(p,x|f^{*})=K(x|f^{*})+K(p|x,K(x),f^{*})+O(1) \end{aligned}$$(A.5)Since we can compute p from q and go on to compute x given \(f^{*}\), \(l(q)=K(p,x|f^{*})+O(1)\). After all, note that with Theorem 2(2), we have \(l(q)=K(p)=K(p|f^{*})\le K(p,x|f^{*})\) up to additive constants, but since we can compute \(\left\langle p,x\right\rangle \) given \(f^{*}\) from q, we know \(K(p,x|f^{*})\le l(q)+O(1)\). Both inequalities can only be true if the equality \(l(q)=K(p,x|f^{*})+O(1)\) holds. At the same time, from Eq. (A.4), \(l(q)=K(x|f^{*})\) holds. Inserting this into Eq. (A.5) leads to \(K(p|x,K(x),f^{*})=O(1)\). Taking \(K(p)=K(p|f^{*})+O(1)\) (Theorem 2(2)), and inserting the conditionals x and K(x) leads to: \(K(p|x,K(x))=K(p|x,K(x),f^{*})+O(1)=O(1)\). Since this shows that a \(p\in P_{f^{*}}(x)\) exists with the minimal value \(K(p|x,K(x))=O(1)\), (2) implies that it must be the same or equivalent to the one found by minimizing \(l(f')\).
-
4.
Conditioning Eq. (A.3) on K(x) we get \(K(p|x,K(x))+K(f^{*}|x,K(x))=K(p|x,K(x))+O(1)\) from which the claim follows. \(\square \)
Proof
(Corollary 1 ). Inserting Eq. (5.2) into Eq. (5.1) proves the point. \(\square \)
Proof
(Corollary 2 ). Inserting Eq. (A.2) into Eq. (5.4) and using the incompressibility of \(f^{*}\) (Theorem 1) proves the point. \(\square \)
Proof
(Theorem 4 ). According to the definition of a feature, at a compression step the length of the parameters \(l(p_{i})<l(x)-l(f_{i}^{*})\) and their complexity (Corollary 2) decreases. Since the \(f_{i}^{*}\) are incompressible themselves (Theorem 1), the parameters store the residual information about x. Therefore, at some point, only the possibility \(p_k\equiv {f'}^{*}_k(p_{k-1})=\epsilon \) with \(l(f_{k}^{*})=K(p_{k-1})\) remains and the compression has to stop. Expanding Corollary 2 proves the result: \(K(x)=l(f_{1}^{*})+K(p_{1})+O(1)=l(f_{1}^{*})+l(f_{2}^{*})+K(p_{2})+O(1)=\sum _{i=1}^{k}l(f_{i}^{*})+O(1)\). \(\square \)
Proof
(Theorem 5 ). Algorithmic information is defined as \(I(f_{i}^{*}:f_{j}^{*})\equiv K(f_{j}^{*})-K(f_{j}^{*}|f_{i}^{*})\). The case \(i=j\) is trivial, since \(K(f_{i}^{*}|f_{i}^{*})=0\). If \(i>j\), then \(p_{j}=\left( f_{j+1}^{*}\circ \cdots \circ f_{i}^{*}\right) (p_{i})\), which implies that all information about \(f_{i}\) is in \(p_{j}\). But since according to Theorem 2(1), \(K(f_{j}^{*}|p_{j})=K(f_{j}^{*})+O(1)\) we conclude that \(K(f_{j}^{*}|f_{i}^{*})=K(f_{j}^{*})+O(1)\). If \(i<j\), then we know that \(f_{j}^{*}\) in no way contributed to the construction of \(p_{i}\) further in the compression process. Hence \(K(f_{j}^{*}|f_{i}^{*})=K(f_{j}^{*})\). \(\square \)
Proof
(Theorem 6 ). Let \(p\equiv f'^{*}(x)\). Further, from Lemma 1 we know that \(K(x|p)=l(f^{*})\) and \(K(p|x)=l(f'^{*})\). Using Corollary 2, the difference in algorithmic information is \(I(p:x)-I(x:p)=K(x)-K(x|p)-K(p)+K(p|x)=l(f'^{*})+O(1)\). By [8, Lemma 3.9.2, p. 250], algorithmic information is symmetric up to logarithmic terms: \(|I(x:p)-I(p:x)|\le \log K(x)+2\log \log K(x)+\log K(p)+2\log \log K(p)+O(1)\). Since x is computed from \(f^{*}\) and p, we have \(K(p)\le K(x)\). Putting everything together leads to \(l(f'^{*})\le 2\log K(x)+4\log \log K(x)+O(1)\). The second inequality follows from \(K(x)\le l(x)+O(1)\). \(\square \)
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Franz, A. (2016). Some Theorems on Incremental Compression. In: Steunebrink, B., Wang, P., Goertzel, B. (eds) Artificial General Intelligence. AGI 2016. Lecture Notes in Computer Science(), vol 9782. Springer, Cham. https://doi.org/10.1007/978-3-319-41649-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-41649-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41648-9
Online ISBN: 978-3-319-41649-6
eBook Packages: Computer ScienceComputer Science (R0)