Skip to main content

On Hierarchical Compression and Power Laws in Nature

Part of the Lecture Notes in Computer Science book series (LNAI,volume 10414)


Since compressing data incrementally by a non-branching hierarchy has resulted in substantial efficiency gains for performing induction in previous work, we now explore branching hierarchical compression as a means for solving induction problems for generally intelligent systems. Even though assuming the compositionality of data generation and the locality of information may result in a loss of the universality of induction, it has still the potential to be general in the sense of reflecting the inherent structure of real world data imposed by the laws of physics. We derive a proof that branching compression hierarchies (BCHs) create power law functions of mutual algorithmic information between two strings as a function of their distance – a ubiquitous characteristic of natural data, which opens the possibility of efficient natural data compression by BCHs. Further, we show that such hierarchies guarantee the existence of short features in the data which in turn increases the efficiency of induction even more.


  • Hierarchical compression
  • Incremental compression
  • Algorithmic complexity
  • Universal induction
  • Power laws
  • Scale free structure

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

    For notation and definitions please consult the Preliminaries section below.


  1. Solomonoff, R.J.: A formal theory of inductive inference. Part I. Inf. Control 7(1), 1–22 (1964)

    CrossRef  MathSciNet  MATH  Google Scholar 

  2. Solomonoff, R.J.: A formal theory of inductive inference. Part II. Inf. Control 7(2), 224–254 (1964)

    CrossRef  MathSciNet  MATH  Google Scholar 

  3. Lin, H.W., Tegmark, M.: Why does deep and cheap learning work so well? arXiv preprint arXiv:1608.08225 (2016)

  4. Franz, A.: Some theorems on incremental compression. In: Steunebrink, B., Wang, P., Goertzel, B. (eds.) AGI -2016. LNCS, vol. 9782, pp. 74–83. Springer, Cham (2016). doi:10.1007/978-3-319-41649-6_8

    Google Scholar 

  5. Bak, P.: How Nature Works: The Science of Self-organized Criticality. Copernicus, New York (1996)

    CrossRef  MATH  Google Scholar 

  6. Saremi, S., Sejnowski, T.J.: Hierarchical model of natural images and the origin of scale invariance. Proc. Natl. Acad. Sci. 110(8), 3071–3076 (2013)

    CrossRef  MathSciNet  MATH  Google Scholar 

  7. Lin, H.W., Tegmark, M.: Critical behavior from deep dynamics: a hidden dimension in natural language. arXiv preprint arXiv:1606.06737 (2016)

  8. Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York (2009)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Arthur Franz .

Editor information

Editors and Affiliations

A Proofs

A Proofs


(Lemma 1 ). Recall that \(f_{l}\) and \(p_{l}\) are the shortest feature and parameter of \(q_{l-1}\) and therefore independent, \(K(q_{l-1})\mathop {=}\limits ^{+}l(f_{l})+K(p_{l})\), as was proven in [4, Corrolary 2]. From Eq. (3.1) we obtain

$$\begin{aligned} \begin{aligned} K(q_{0})&\mathop {=}\limits ^{+}l(f_{1})+K(p_{1})\mathop {=}\limits ^{+}l(f_{1})+\alpha _{1}K(q_{1})\mathop {=}\limits ^{+}l(f_{1})+\alpha _{1}\left( l(f_{2})+\alpha _{2}K(q_{2})\right) \\&\mathop {=}\limits ^{+} K(q_{h})\prod _{l=1}^{h}\alpha _{l}+\sum _{m=1}^{h}l(f_{m})\prod _{l=1}^{m-1}\alpha _{l} \end{aligned} \end{aligned}$$

Since \(f_{l}\) and \(p_{l}\) cannot be made dependent by conditioning, we get \(K(q_{l-1}|q_{h})\mathop {=}\limits ^{+}K(f_{l}|q_{h})+K(p_{l}|q_{h})\). Due to assumption (2),  the first term becomes \(K(f_{l}|q_{h})=K(f_{l})\mathop {=}\limits ^{+}l(f_{l})\). Therefore, the conditional version can be computed analogously to Eq. (A.1):

$$\begin{aligned} K(q_{0}|q_{h})\mathop {=}\limits ^{+}K(q_{h}|q_{h})\prod _{l=1}^{h}\alpha _{l}+\sum _{m=1}^{h}l(f_{m})\prod _{l=1}^{m-1}\alpha _{l} \end{aligned}$$

However, since \(K(q_{h}|q_{h})=O(1)\) we obtain for the information in \(q_{h}\) about \(q_{0}\):

$$\begin{aligned} I(q_{h}:q_{0})\equiv K(q_{0})-K(q_{0}|q_{h})\mathop {=}\limits ^{+}K(q_{h})\prod _{l=1}^{h}\alpha _{l} \end{aligned}$$

   \(\square \)


(Lemma 2 ). We can in general expand [8, Theorem 3.9.1, p. 247]

$$\begin{aligned} K(y,z|a)\mathop {=}\limits ^{+}K(y|a)+K(z|y,K(y),a) \end{aligned}$$

and insert it into the independence relation Eq. 3.4. This leads to

$$\begin{aligned} K(z|a)\mathop {=}\limits ^{+}K(z|y,K(y),a)\mathop {\le }\limits ^{+}K(z|y) \end{aligned}$$

where the last inequality follows from the fact that conditioning can only reduce the description length of z [8, Theorem 2.1.2, p. 108]. Subtracting this inequality from K(z) yields \(K(z)-K(z|a)\mathop {\ge }\limits ^{+}K(z)-K(z|y)\). Now we insert the definition of mutual information \(I(a:z)\equiv K(z)-K(z|a)\) on both sides from which the claim follows.    \(\square \)


(Theorem 1 ). First, from the result in Eq. (3.3) and Lemma 2 it follows that \(I(x_{i}:x_{j})\) decays exponentially with the height h of their common ancestor \(q_{h}\)

$$\begin{aligned} I(x_{i}:x_{j})\mathop {\le }\limits ^{+}K(q_{h})\cdot \prod _{l=1}^{h}\alpha _{l} \end{aligned}$$

under our assumptions. Consider that the maximal index distance between leaves in a perfect tree increases exponentially with the height h of the common ancestor:

$$\begin{aligned} d_{ij}<\prod _{l=1}^{h}\hat{b}_{l} \end{aligned}$$

where \(\hat{b}_{l}\) is the average branching factor at level l of the tree. By defining the total average branching factor \(\bar{b}\equiv \left( \prod _{l=1}^{h}\hat{b}_{l}\right) ^{1/h}>d_{ij}^{1/h}\), we can solve for \(h>\log _{\bar{b}}(d_{ij})\) and compute:

$$\begin{aligned} \log _{\bar{b}}\left( \prod _{l=1}^{h}\alpha _{l}\right) <\sum _{l=1}^{\log _{\bar{b}}(d_{ij})}\log _{\bar{b}}(\alpha _{l})=-\sum _{l=1}^{\log _{\bar{b}}(d_{ij})}\nu _{l}=-\left\langle v\right\rangle \log _{\bar{b}}(d_{ij})=\log _{\bar{b}}\left( d_{ij}^{-\left\langle \nu \right\rangle }\right) \end{aligned}$$

where \(\nu _{l}\equiv \log _{\bar{b}}(1/\alpha _{l})>0\). Inserting this into Eq. A.3 concludes the proof.    \(\square \)


(Lemma 3 ). Consider the general expansion [8, Theorem 3.9.1, p. 247]

$$\begin{aligned} K(xy)\mathop {=}\limits ^{+}K(x)+K(y|x,K(x)) \end{aligned}$$

I is defined by \(I(x:y)\equiv K(y)-K(y|x)\) and is larger than zero by assumption. Since in general \(K(y|x,K(x))\mathop {\le }\limits ^{+}K(y|x)\) we obtain

$$\begin{aligned} \begin{aligned} K(xy)&\mathop {=}\limits ^{+}K(x)+K(y)+K(y|x,K(x))-K(y|x)-I(x:y)\\&\mathop {<}\limits ^{+}K(x)+K(y)\mathop {\le }\limits ^{+}l(x)+l(y)=l(xy) \end{aligned} \end{aligned}$$

   \(\square \)


(Theorem 2 ). Since y is \(l(\lambda )\)-compressible by q, \(\lambda (q,p)=U\left( \left\langle \lambda ,q,p\right\rangle \right) =x\) and \(l(x)=l(y)+l(p)\), x is compressible as well:

$$\begin{aligned} K(x)\le l(\lambda )+l(q)+l(p)=l(\lambda )+K(y)+l(x)-l(y)<l(x) \end{aligned}$$

We define \(f\equiv \left\langle \lambda ,q\right\rangle \) and obtain \(U(\left\langle f,p\right\rangle )=f(p)=x\) – the main feature equation. We can define the descriptive map \(f'\) by a function that removes y from x to obtain the remainder p: \(f'(x)=p\). It suffices if it does so for that particular x and y, not in general.

From fs definition, we get \(l(f)=l(\lambda )+l(q)=l(\lambda )+K(y)<l(y)\) since y is \(l(\lambda )\)-compressible by assumption. It follows that the (fp)-pair compresses x at least to some extent, \(l(f)+l(p)<l(y)+l(p)=l(x)\). Therefore, f is indeed a feature of x and its length is bounded by l(y).    \(\square \)


(Theorem 3 ). In general, the relation \(K(p)\mathop {\le }\limits ^{+}K(p|z)+K(z)\) is valid, since if p is computable by a detour via z, its shortest program without the detour can only be shorter. Setting \(z=K(x)\) and conditioning on x leads to

$$\begin{aligned} K(p|x)\mathop {\le }\limits ^{+}K(p|K(x),x)+K(K(x)|x) \end{aligned}$$

The conditioning operation is not valid in general, however the detour argument is still valid in this case. Since \(K(p|x)=l(f')\) [4, Lemma 1(2)] and \(K(p|K(x),x)=O(1)\) [4, Theorem 3(3)], we get

$$\begin{aligned} l(f')\mathop {\le }\limits ^{+}K(K(x)|x) \end{aligned}$$

We now insert the “complexity of the complexity” expression in [8, Lemma 3.9.2, Eq. (3.18)] \(K(K(x)|x)\mathop {\le }\limits ^{+}\log K(x)+2\log \log K(x)\) and the first claim follows. The second claim is a property of K(K(x)|x) [8, Eq. (3.13)] and therefore also holds for \(l(f')\).    \(\square \)

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Franz, A. (2017). On Hierarchical Compression and Power Laws in Nature. In: Everitt, T., Goertzel, B., Potapov, A. (eds) Artificial General Intelligence. AGI 2017. Lecture Notes in Computer Science(), vol 10414. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63702-0

  • Online ISBN: 978-3-319-63703-7

  • eBook Packages: Computer ScienceComputer Science (R0)