Using Ideas of Kolmogorov Complexity for Studying Biological Texts

Ryabko, Boris; Reznikova, Zhanna; Druzyaka, Alexey; Panteleeva, Sofia

doi:10.1007/s00224-012-9403-6

Using Ideas of Kolmogorov Complexity for Studying Biological Texts

Published: 03 May 2012

Volume 52, pages 133–147, (2013)
Cite this article

Theory of Computing Systems Aims and scope Submit manuscript

Boris Ryabko^1,2,
Zhanna Reznikova^3,4,
Alexey Druzyaka^3,4 &
…
Sofia Panteleeva^3,4

293 Accesses
12 Citations
Explore all metrics

Abstract

Kolmogorov complexity furnishes many useful tools for studying different natural processes that can be expressed using sequences of symbols from a finite alphabet (texts), such as genetic texts, literary and music texts, animal communications, etc. Although Kolmogorov complexity is not algorithmically computable, in a certain sense it can be estimated by means of data compressors. Here we suggest a method of analysis of sequences based on ideas of Kolmogorov complexity and mathematical statistics, and apply this method to biological (ethological) “texts.” A distinction of the suggested method from other approaches to the analysis of sequential data by means of Kolmogorov complexity is that it belongs to the framework of mathematical statistics, more specifically, that of hypothesis testing. This makes it a promising candidate for being included in the toolbox of standard biological methods of analysis of different natural texts, from DNA sequences to animal behavioural patterns (ethological “texts”). Two examples of analysis of ethological texts are considered in this paper. Theses examples show that the proposed method is a useful tool for distinguishing between stereotyped and flexible behaviours, which is important for behavioural and evolutionary studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emergence in complex networks of simple agents

Article Open access 23 May 2023

The discovery of archaea: from observed anomaly to consequential restructuring of the phylogenetic tree

Article Open access 26 March 2024

Introduction to Bioinformatics

References

Anel, C., Sanderson, M.J.: Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. Syst. Biol. 54(1), 146–157 (2005)
Article Google Scholar
Billingsley, P.: Ergodic Theory and Information. Wiley, New York (1965)
MATH Google Scholar
Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)
Article MathSciNet Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New York (2006)
MATH Google Scholar
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinf. 8 (2007)
Fisher, R.A.: Statistical Methods, Experimental Design, and Scientific Inference. Oliver & Boyd, Edinburgh (1956)
Google Scholar
Gallager, R.G.: Information Theory and Reliable Communication. Wiley, New York (1968)
MATH Google Scholar
Groothuis, T.: The influence of social experience on the development and fixation of the form of displays in the black-headed gull. Anim. Behav. 43(1), 1–14 (1992)
Article Google Scholar
Hutter, M.: Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic Probability. Springer, Berlin (2005)
MATH Google Scholar
Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics, Inference and Relationship, vol. 2. Griffin, London (1961)
Google Scholar
KGB archiver (v. 1.2). http://www.softpedia.com/get/Compression-tools/KGB-Archiver.shtml
Kieffer, J., Yang, E.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46, 737–754 (2000)
Article MathSciNet MATH Google Scholar
Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)
MATH Google Scholar
Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.Y.: An information based distance and its application to whole mitochondrial genome phylogeny. Bioinformatics (Oxford) 17, 149–154 (2001)
Article Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)
Article MathSciNet Google Scholar
Lloyd, E. (ed.): Handbook of Applied Statistics, vol. 2. Wiley-Interscience, New York (1984)
Google Scholar
McCowan, B., Doyle, L.R., Hanser, S.F.: Using information theory to assess the diversity, complexity, and development of communicative repertoires. J. Comp. Psychol. 116(2), 166–172 (2002)
Article Google Scholar
Oller, D.K., Griebel, U. (eds.): Evolution of Communicative Flexibility: Complexity, Creativity, and Adaptability in Human and Animal Communication. MIT Press, Cambridge (2008)
Google Scholar
Panteleeva, S., Danzanov, Zh., Reznikova, Zh.: Estimate of complexity of behavioral patterns in ants: analysis of hunting behavior in myrmica rubra (hymenoptera, formicidae) as an example. Entomol. Rev. 91(2), 221–230 (2011)
Article Google Scholar
Reznikova, Z.: Animal Intelligence: From Individual to Social Cognition. Cambridge University Press, Cambridge (2007)
Google Scholar
Reznikova, Zh., Panteleeva, S.: An ant’s eye view of culture: propagation of new traditions through triggering dormant behavioural patterns. Acta Ethol. 11, 73–80 (2008)
Article Google Scholar
Rissanen, J.: Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory 30(4), 629–636 (1984)
Article MathSciNet MATH Google Scholar
Ryabko, B.: Prediction of random sequences and universal coding. Probl. Inf. Transm. 24(2), 87–96 (1988)
MathSciNet MATH Google Scholar
Ryabko, B., Reznikova, Zh.: Using Shannon entropy and Kolmogorov complexity to study the communicative system and cognitive capacities in ants. Complexity 2, 37–42 (1996)
Article MathSciNet Google Scholar
Ryabko, B., Reznikova, Z.: The use of ideas of information theory for studying “language” and intelligence in ants. Entropy 11, 836–853 (2009)
Article Google Scholar
Ryabko, D., Schmidhuber, J.: Using data compressors to construct order tests for homogeneity and component independence. Appl. Math. Lett. 22(7), 1029–1032 (2009)
Article MathSciNet MATH Google Scholar
Ryabko, B., Astola, J., Gammerman, A.: Application of Kolmogorov complexity and universal codes to identity testing and nonparametric testing of serial independence for time series. Theor. Comput. Sci. 359, 440–448 (2006)
Article MathSciNet MATH Google Scholar
Tinbergen, N.: An objective study of the innate behaviour of animals. Bibl. Biotheor. 1, 39–98 (1942)
Google Scholar
Tinbergen, N.: The Study of Instinct. Oxford University Press, London (1951)
Google Scholar
Vitanyi, P.M.B.: Information distance in multiples. IEEE Trans. Inf. Theory 57(4), 2451–2456 (2011)
Article MathSciNet Google Scholar
Yaglom, A.M., Yaglom, I.M.: Probability and Information. Theory and Decision Library. Springer, Berlin (1983)
MATH Google Scholar
Zvonkin, A.K., Levin, L.A.: The complexity of finite objects and concepts of information and randomness through the algorithm theory. Russ. Math. Surv. 25(6), 83–124 (1970)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Siberian State University of Telecommunications and Information Sciences, Novosibirsk, Russia
Boris Ryabko
Institute of Computational Technology of Siberian Branch of Russian Academy of Science, Novosibirsk, Russia
Boris Ryabko
Institute of Systematics and Ecology of Animals, Russian Academy of Science, Novosibirsk, Russia
Zhanna Reznikova, Alexey Druzyaka & Sofia Panteleeva
Novosibirsk State University, Novosibirsk, Russia
Zhanna Reznikova, Alexey Druzyaka & Sofia Panteleeva

Authors

Boris Ryabko
View author publications
You can also search for this author in PubMed Google Scholar
Zhanna Reznikova
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Druzyaka
View author publications
You can also search for this author in PubMed Google Scholar
Sofia Panteleeva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Boris Ryabko.

Additional information

Research was supported by Russian Foundation for Basic Research (grants 12-07-00125 and 11-04-00536), by the Integrated Project of Siberian Branch RAS (grant N 21), the Program “Living Nature” of the Presidium of Russian Academy of Science (grant No. 30.6), and by the Program of cooperative investigations of SB RAS and third parties (grant No. 63).

Appendices

Appendix 1: Universal Codes, Shannon Entropy and Kolmogorov Complexity

First we briefly describe stochastic processes (or sources of information). Consider a finite alphabet A, and denote by A ^t and A ^∗ the set of all words of length t over A and the set of all finite words over A correspondingly ($A^{*} = \bigcup_{i=1}^{\infty}A^{i}$).

A process P is called stationary if

$$P(x_1,\dots,x_k=a_1,\dots,a_k)=P(x_{t+1},\dots,x_{t+k}=a_1,\dots,a_k) $$

for all t,k∈N and all (a ₁,…,a _k)∈A ^k. A stationary process is called stationary ergodic if the frequency of occurrence of every word a ₁,…,a _k converges (a.s.) to P(a ₁,…,a _k). For more details see [2, 4, 7].

Let τ be a stationary ergodic source generating letters from a finite alphabet A. The m-order (conditional) Shannon entropy and the limit Shannon entropy are defined as follows:

$$ h_m(\tau) = - \sum_{v \in A^m} \tau(v) \sum_{a \in A} \tau(a|\,v) \log \tau(a|\,v),\qquad h_\infty(\tau) = \lim_{m \rightarrow \infty} h_m(\tau), $$

(6)

[2, 7]. The well known Shannon-MacMillan-Breiman theorem states that

$$ \lim_{t\rightarrow\infty} - \log \tau(x_1 \ldots x_t) /t = h_\infty(\tau) $$

(7)

with probability 1, see [2, 4, 7].

Now we define codes and Kolmogorov complexity. Let A ^∞ be the set of all infinite words x ₁ x ₂… over the alphabet A. A data compression method (or code) φ is defined as a set of mappings φ _n such that φ _n:A ⁿ→{0,1}^∗,n=1,2,… and for each pair of different words $x,y \in A^{n} \:$ φ _n(x)≠φ _n(y). Informally, it means that the code φ can be applied for compression of each message of any length n over the alphabet A and the message can be decoded if its code is known. It is also required that each sequence φ _n(u ₁)φ _n(u ₂)…φ _n(u _r),r≥1, of encoded words from the set A ⁿ,n≥1, can be uniquely decoded into u ₁ u ₂…u _r. Such codes are called uniquely decodable. For example, let A={a,b}, the code ψ ₁(a)=0, ψ ₁(b)=00, obviously, is not uniquely decodable. It is well known that if a code φ is uniquely decodable then the lengths of the codewords satisfy the following inequality (Kraft inequality): $\Sigma_{u \in A^{n}}\: 2^{- |\varphi_{n} (u) |} \leq 1$, see, for ex., [4, 7].

In this paper we will use the so-called prefix Kolmogorov complexity, whose precise definition can be found in [9, 13]. Its main properties can be described as follows. There exists a uniquely decodable code κ such that (i) there is an algorithm for decoding (i.e. there is a Turing machine, which maps κ(u) to u for every u∈A ^∗) and (ii) for any uniquely decodable code ψ, whose decoding is algorithmically realizable, there exists a non-negative constant C _ψ that

$$ \bigl|\kappa(u)\bigr| - \bigl|\psi(u)\bigr| < C_{\psi} $$

(8)

for every u∈A ^∗; see Theorem 3.1.1 in [13]. The prefix Kolmogorov complexity K(u) is defined as the length of κ(u): K(u)=|κ(u)|. The code κ is not unique, but the second property means that codelengths of two codes κ ₁ and κ ₂, for which (i) and (ii) are true, are equal up to a constant: ||κ ₁(u)|−|κ ₂(u)||<C _1,2 for any word u (and the constant C _1,2 does not depend on u, see (8)). So, K(u) is defined up to a constant. In what follows we call this value “Kolmogorov complexity”.

We can see from (ii) that the code κ is asymptotically (up to a constant) the best method of data compression, but it turns out that there is no algorithm that can calculate the codeword κ(u) (and even K(u)). That is why the code κ (and Kolmogorov complexity) cannot be used for practical data compression directly.

The following Claim is by Levin [32, Proposition 5.1]:

Claim

For any stationary ergodic source τ

$$ \lim_{t \rightarrow \infty} t^{-1} K(x_1 \ldots x_t) = h_\infty(\tau) $$

(9)

with probability 1.

Comment

In [32] this claim is formulated for “common” Kolmogorov complexity, but it is also valid for the prefix Kolmogorov complexity, because for any word x ₁…x _t the difference between both complexities equals O(logt), see [13].

Let us describe universal codes, or data compressors. For their description we recall that (as it is known in Information Theory) sequences x ₁…x _t, generated by a source p, can be “compressed” till the length −logp(x ₁…x _t) bits and, on the other hand, there is no code ψ for which the expected codeword length $( \Sigma_{x_{1} \ldots x_{t} \in A^{t}} p(x_{1} \ldots x_{t}) |\psi(x_{1} \ldots x_{t})|)$ is less than $- \Sigma_{x_{1} \ldots x_{t} \in A^{t}} \allowbreak p(x_{1} \ldots x_{t})\allowbreak \log p(x_{1} \ldots x_{t})$. The universal codes can reach the lower bound −logp(x ₁…x _t) asymptotically for any stationary ergodic source p with probability 1. The formal definition is as follows: A code φ is universal if for any stationary ergodic source p

$$ \lim_{t \rightarrow \infty} t^{-1} \bigl(- \log p(x_1 \ldots x_t) - \bigl|\varphi(x_1 \ldots x_t)\bigr| \bigr) = 0 $$

(10)

with probability 1. So, informally speaking, universal codes estimate the probability characteristics of the source p and use them for efficient “compression.”

Appendix 2: Proofs of Theorems

Proof of Theorem 1

For any universal code from the Shannon-McMillan-Breiman theorem (7) and the definition (10) we obtain the following equation

$$ \lim_{t \rightarrow \infty} t^{-1} \bigl|\varphi(x_1 \ldots x_t)\bigr| = h_\infty(x_1 \ldots x_t) , $$

(11)

with probability 1. Having taken into account this equality and (9), we obtain the statement of the Theorem 1. □

Proof of Theorem 2

For any t the level of significance of the test T equals α, so, by definition, the Type I error of the test $T^{'}_{\varphi}$ equals α, too. In order to prove the second statement of the theorem we suppose that the hypothesis H ₁ is true. From (8) we can see that there exist constants k ₁ and k ₂ such that with probability 1

$$ \lim_{t \rightarrow \infty} t^{-1} K(x_1 \ldots x_t) = k_i , $$

(12)

where x ₁…x _t ∈S _i, i=1,2, and k ₁≠k ₂. Let us suppose that k ₁>k ₂ and define Δ=k ₁−k ₂. From (1), (12) and Theorem 2 we can see that

$$ \lim_{t \rightarrow \infty} K_\varphi(x_1 \ldots x_t) = k_i $$

(13)

with probability 1 (here x ₁…x _t ∈S _i, i=1,2). By definition, it means that for any ϵ>0 and δ>0 there exists such t′ that

$$P \bigl\{ \bigl| K_\varphi(x_1 \ldots x_t) - k_i \bigr| < \epsilon \bigr\} \ge 1-\delta $$

for x ₁…x _t∈S _i, i=1,2, when t>t′. Hence, if ϵ=Δ/4, then with probability at least 1−δ all values K _φ(x ₁…x _t), x ₁…x _t∈S ₁ are less than all values K _φ(x ₁…x _t), x ₁…x _t∈S ₂. So, with probability at least 1−δ a set of the ranked values will look like the right part of (4) and, hence, the hypothesis H ₁ will be rejected (for large enough |S _i|, i=1,2). Having taken into account that min(|S ₁|,|S ₂|) →∞ and t→∞, we can see that the last statement is valid for any δ. The theorem is proved. □

Appendix 3: Dictionary of Gull’s Behaviours

Symbol	Gull’s position	Demonstrative postures and actions	Position of wings	Vocalisation
0	sitting on eggs	upright	folding	aggressive call
1	sitting on eggs	upright	folding	no
2	sitting on eggs	oblique	folding	aggressive call
3	sitting on eggs	oblique	folding	long call
4	sitting on eggs	oblique	folding	no
5	sitting on eggs	biting with a beak	folding	no
6	sitting on eggs	snapping with a beak	folding	no
7	sitting on eggs	no	folding	aggressive call
7	sitting on eggs	no	folding	long call
8	sitting on eggs	no	folding	no
9	standing on a nest	upright	folding	aggressive call
a	standing on a nest	upright	folding	no
b	standing on a nest	oblique	stretching	aggressive call
c	standing on a nest	oblique	stretching	long call
d	standing on a nest	oblique	stretching	no
e	standing on a nest	oblique	folding	aggressive call
f	standing on a nest	oblique	folding	long call
g	standing on a nest	oblique	folding	no
h	standing on a nest	oblique	flapping	aggressive call
i	standing on a nest	oblique	flapping	long call
j	standing on a nest	oblique	flapping	no
k	standing on a nest	biting with a beak	stretching	no
l	standing on a nest	biting with a beak	folding	no
m	standing on a nest	biting with a beak	flapping	no
n	standing on a nest	snapping with a beak	stretching	no
o	standing on a nest	snapping with a beak	folding	no
p	standing on a nest	snapping with a beak	flapping	no
q	standing on a nest	slashing with a wing	flapping	aggressive call
r	standing on a nest	slashing with a wing	flapping	long call
s	standing on a nest	slashing with a wing	flapping	no
t	standing on a nest	no	stretching	aggressive call
u	standing on a nest	no	stretching	no
v	standing on a nest	no	folding	aggressive call
w	standing on a nest	no	folding	no
x	standing on a nest	no	flapping	aggressive call
y	standing on a nest	no	flapping	no
z	flying	no	flapping	aggressive call
A	flying	no	flapping	no
B	flying	swooping to attack	flapping	long call
C	flying	swooping to attack	flapping	aggressive call
D	flying	swooping to attack	flapping	no
E	flying	biting with a beak	flapping	no
F	flying	snapping with a beak	flapping	no
G	sitting on a perch	oblique	stretching	aggressive call
H	sitting on a perch	oblique	stretching	long call
I	sitting on a perch	oblique	stretching	no
J	sitting on a perch	oblique	folding	aggressive call
K	sitting on a perch	oblique	folding	long call

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ryabko, B., Reznikova, Z., Druzyaka, A. et al. Using Ideas of Kolmogorov Complexity for Studying Biological Texts. Theory Comput Syst 52, 133–147 (2013). https://doi.org/10.1007/s00224-012-9403-6

Download citation

Published: 03 May 2012
Issue Date: January 2013
DOI: https://doi.org/10.1007/s00224-012-9403-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Ideas of Kolmogorov Complexity for Studying Biological Texts

Abstract

Access this article

Similar content being viewed by others

Emergence in complex networks of simple agents

The discovery of archaea: from observed anomaly to consequential restructuring of the phylogenetic tree

Introduction to Bioinformatics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Universal Codes, Shannon Entropy and Kolmogorov Complexity

Claim

Comment

Appendix 2: Proofs of Theorems

Proof of Theorem 1

Proof of Theorem 2

Appendix 3: Dictionary of Gull’s Behaviours

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using Ideas of Kolmogorov Complexity for Studying Biological Texts

Abstract

Access this article

Similar content being viewed by others

Emergence in complex networks of simple agents

The discovery of archaea: from observed anomaly to consequential restructuring of the phylogenetic tree

Introduction to Bioinformatics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Universal Codes, Shannon Entropy and Kolmogorov Complexity

Claim

Comment

Appendix 2: Proofs of Theorems

Proof of Theorem 1

Proof of Theorem 2

Appendix 3: Dictionary of Gull’s Behaviours

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation