Skip to main content
Log in

Using Ideas of Kolmogorov Complexity for Studying Biological Texts

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

Kolmogorov complexity furnishes many useful tools for studying different natural processes that can be expressed using sequences of symbols from a finite alphabet (texts), such as genetic texts, literary and music texts, animal communications, etc. Although Kolmogorov complexity is not algorithmically computable, in a certain sense it can be estimated by means of data compressors. Here we suggest a method of analysis of sequences based on ideas of Kolmogorov complexity and mathematical statistics, and apply this method to biological (ethological) “texts.” A distinction of the suggested method from other approaches to the analysis of sequential data by means of Kolmogorov complexity is that it belongs to the framework of mathematical statistics, more specifically, that of hypothesis testing. This makes it a promising candidate for being included in the toolbox of standard biological methods of analysis of different natural texts, from DNA sequences to animal behavioural patterns (ethological “texts”). Two examples of analysis of ethological texts are considered in this paper. Theses examples show that the proposed method is a useful tool for distinguishing between stereotyped and flexible behaviours, which is important for behavioural and evolutionary studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anel, C., Sanderson, M.J.: Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. Syst. Biol. 54(1), 146–157 (2005)

    Article  Google Scholar 

  2. Billingsley, P.: Ergodic Theory and Information. Wiley, New York (1965)

    MATH  Google Scholar 

  3. Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)

    Article  MathSciNet  Google Scholar 

  4. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New York (2006)

    MATH  Google Scholar 

  5. Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinf. 8 (2007)

  6. Fisher, R.A.: Statistical Methods, Experimental Design, and Scientific Inference. Oliver & Boyd, Edinburgh (1956)

    Google Scholar 

  7. Gallager, R.G.: Information Theory and Reliable Communication. Wiley, New York (1968)

    MATH  Google Scholar 

  8. Groothuis, T.: The influence of social experience on the development and fixation of the form of displays in the black-headed gull. Anim. Behav. 43(1), 1–14 (1992)

    Article  Google Scholar 

  9. Hutter, M.: Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic Probability. Springer, Berlin (2005)

    MATH  Google Scholar 

  10. Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics, Inference and Relationship, vol. 2. Griffin, London (1961)

    Google Scholar 

  11. KGB archiver (v. 1.2). http://www.softpedia.com/get/Compression-tools/KGB-Archiver.shtml

  12. Kieffer, J., Yang, E.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46, 737–754 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  13. Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)

    MATH  Google Scholar 

  14. Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.Y.: An information based distance and its application to whole mitochondrial genome phylogeny. Bioinformatics (Oxford) 17, 149–154 (2001)

    Article  Google Scholar 

  15. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)

    Article  MathSciNet  Google Scholar 

  16. Lloyd, E. (ed.): Handbook of Applied Statistics, vol. 2. Wiley-Interscience, New York (1984)

    Google Scholar 

  17. McCowan, B., Doyle, L.R., Hanser, S.F.: Using information theory to assess the diversity, complexity, and development of communicative repertoires. J. Comp. Psychol. 116(2), 166–172 (2002)

    Article  Google Scholar 

  18. Oller, D.K., Griebel, U. (eds.): Evolution of Communicative Flexibility: Complexity, Creativity, and Adaptability in Human and Animal Communication. MIT Press, Cambridge (2008)

    Google Scholar 

  19. Panteleeva, S., Danzanov, Zh., Reznikova, Zh.: Estimate of complexity of behavioral patterns in ants: analysis of hunting behavior in myrmica rubra (hymenoptera, formicidae) as an example. Entomol. Rev. 91(2), 221–230 (2011)

    Article  Google Scholar 

  20. Reznikova, Z.: Animal Intelligence: From Individual to Social Cognition. Cambridge University Press, Cambridge (2007)

    Google Scholar 

  21. Reznikova, Zh., Panteleeva, S.: An ant’s eye view of culture: propagation of new traditions through triggering dormant behavioural patterns. Acta Ethol. 11, 73–80 (2008)

    Article  Google Scholar 

  22. Rissanen, J.: Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory 30(4), 629–636 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  23. Ryabko, B.: Prediction of random sequences and universal coding. Probl. Inf. Transm. 24(2), 87–96 (1988)

    MathSciNet  MATH  Google Scholar 

  24. Ryabko, B., Reznikova, Zh.: Using Shannon entropy and Kolmogorov complexity to study the communicative system and cognitive capacities in ants. Complexity 2, 37–42 (1996)

    Article  MathSciNet  Google Scholar 

  25. Ryabko, B., Reznikova, Z.: The use of ideas of information theory for studying “language” and intelligence in ants. Entropy 11, 836–853 (2009)

    Article  Google Scholar 

  26. Ryabko, D., Schmidhuber, J.: Using data compressors to construct order tests for homogeneity and component independence. Appl. Math. Lett. 22(7), 1029–1032 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  27. Ryabko, B., Astola, J., Gammerman, A.: Application of Kolmogorov complexity and universal codes to identity testing and nonparametric testing of serial independence for time series. Theor. Comput. Sci. 359, 440–448 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  28. Tinbergen, N.: An objective study of the innate behaviour of animals. Bibl. Biotheor. 1, 39–98 (1942)

    Google Scholar 

  29. Tinbergen, N.: The Study of Instinct. Oxford University Press, London (1951)

    Google Scholar 

  30. Vitanyi, P.M.B.: Information distance in multiples. IEEE Trans. Inf. Theory 57(4), 2451–2456 (2011)

    Article  MathSciNet  Google Scholar 

  31. Yaglom, A.M., Yaglom, I.M.: Probability and Information. Theory and Decision Library. Springer, Berlin (1983)

    MATH  Google Scholar 

  32. Zvonkin, A.K., Levin, L.A.: The complexity of finite objects and concepts of information and randomness through the algorithm theory. Russ. Math. Surv. 25(6), 83–124 (1970)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Boris Ryabko.

Additional information

Research was supported by Russian Foundation for Basic Research (grants 12-07-00125 and 11-04-00536), by the Integrated Project of Siberian Branch RAS (grant N 21), the Program “Living Nature” of the Presidium of Russian Academy of Science (grant No. 30.6), and by the Program of cooperative investigations of SB RAS and third parties (grant No. 63).

Appendices

Appendix 1: Universal Codes, Shannon Entropy and Kolmogorov Complexity

First we briefly describe stochastic processes (or sources of information). Consider a finite alphabet A, and denote by A t and A the set of all words of length t over A and the set of all finite words over A correspondingly (\(A^{*} = \bigcup_{i=1}^{\infty}A^{i}\)).

A process P is called stationary if

$$P(x_1,\dots,x_k=a_1,\dots,a_k)=P(x_{t+1},\dots,x_{t+k}=a_1,\dots,a_k) $$

for all t,kN and all (a 1,…,a k )∈A k. A stationary process is called stationary ergodic if the frequency of occurrence of every word a 1,…,a k converges (a.s.) to P(a 1,…,a k ). For more details see [2, 4, 7].

Let τ be a stationary ergodic source generating letters from a finite alphabet A. The m-order (conditional) Shannon entropy and the limit Shannon entropy are defined as follows:

$$ h_m(\tau) = - \sum_{v \in A^m} \tau(v) \sum_{a \in A} \tau(a|\,v) \log \tau(a|\,v),\qquad h_\infty(\tau) = \lim_{m \rightarrow \infty} h_m(\tau), $$
(6)

[2, 7]. The well known Shannon-MacMillan-Breiman theorem states that

$$ \lim_{t\rightarrow\infty} - \log \tau(x_1 \ldots x_t) /t = h_\infty(\tau) $$
(7)

with probability 1, see [2, 4, 7].

Now we define codes and Kolmogorov complexity. Let A be the set of all infinite words x 1 x 2… over the alphabet A. A data compression method (or code) φ is defined as a set of mappings φ n such that φ n :A n→{0,1},n=1,2,… and for each pair of different words \(x,y \in A^{n} \:\) φ n (x)≠φ n (y). Informally, it means that the code φ can be applied for compression of each message of any length n over the alphabet A and the message can be decoded if its code is known. It is also required that each sequence φ n (u 1)φ n (u 2)…φ n (u r ),r≥1, of encoded words from the set A n,n≥1, can be uniquely decoded into u 1 u 2u r . Such codes are called uniquely decodable. For example, let A={a,b}, the code ψ 1(a)=0, ψ 1(b)=00, obviously, is not uniquely decodable. It is well known that if a code φ is uniquely decodable then the lengths of the codewords satisfy the following inequality (Kraft inequality): \(\Sigma_{u \in A^{n}}\: 2^{- |\varphi_{n} (u) |} \leq 1\), see, for ex., [4, 7].

In this paper we will use the so-called prefix Kolmogorov complexity, whose precise definition can be found in [9, 13]. Its main properties can be described as follows. There exists a uniquely decodable code κ such that (i) there is an algorithm for decoding (i.e. there is a Turing machine, which maps κ(u) to u for every uA ) and (ii) for any uniquely decodable code ψ, whose decoding is algorithmically realizable, there exists a non-negative constant C ψ that

$$ \bigl|\kappa(u)\bigr| - \bigl|\psi(u)\bigr| < C_{\psi} $$
(8)

for every uA ; see Theorem 3.1.1 in [13]. The prefix Kolmogorov complexity K(u) is defined as the length of κ(u): K(u)=|κ(u)|. The code κ is not unique, but the second property means that codelengths of two codes κ 1 and κ 2, for which (i) and (ii) are true, are equal up to a constant: ||κ 1(u)|−|κ 2(u)||<C 1,2 for any word u (and the constant C 1,2 does not depend on u, see (8)). So, K(u) is defined up to a constant. In what follows we call this value “Kolmogorov complexity”.

We can see from (ii) that the code κ is asymptotically (up to a constant) the best method of data compression, but it turns out that there is no algorithm that can calculate the codeword κ(u) (and even K(u)). That is why the code κ (and Kolmogorov complexity) cannot be used for practical data compression directly.

The following Claim is by Levin [32, Proposition 5.1]:

Claim

For any stationary ergodic source τ

$$ \lim_{t \rightarrow \infty} t^{-1} K(x_1 \ldots x_t) = h_\infty(\tau) $$
(9)

with probability 1.

Comment

In [32] this claim is formulated for “common” Kolmogorov complexity, but it is also valid for the prefix Kolmogorov complexity, because for any word x 1x t the difference between both complexities equals O(logt), see [13].

Let us describe universal codes, or data compressors. For their description we recall that (as it is known in Information Theory) sequences x 1x t , generated by a source p, can be “compressed” till the length −logp(x 1x t ) bits and, on the other hand, there is no code ψ for which the expected codeword length \(( \Sigma_{x_{1} \ldots x_{t} \in A^{t}} p(x_{1} \ldots x_{t}) |\psi(x_{1} \ldots x_{t})|)\) is less than \(- \Sigma_{x_{1} \ldots x_{t} \in A^{t}} \allowbreak p(x_{1} \ldots x_{t})\allowbreak \log p(x_{1} \ldots x_{t})\). The universal codes can reach the lower bound −logp(x 1x t ) asymptotically for any stationary ergodic source p with probability 1. The formal definition is as follows: A code φ is universal if for any stationary ergodic source p

$$ \lim_{t \rightarrow \infty} t^{-1} \bigl(- \log p(x_1 \ldots x_t) - \bigl|\varphi(x_1 \ldots x_t)\bigr| \bigr) = 0 $$
(10)

with probability 1. So, informally speaking, universal codes estimate the probability characteristics of the source p and use them for efficient “compression.”

Appendix 2: Proofs of Theorems

Proof of Theorem 1

For any universal code from the Shannon-McMillan-Breiman theorem (7) and the definition (10) we obtain the following equation

$$ \lim_{t \rightarrow \infty} t^{-1} \bigl|\varphi(x_1 \ldots x_t)\bigr| = h_\infty(x_1 \ldots x_t) , $$
(11)

with probability 1. Having taken into account this equality and (9), we obtain the statement of the Theorem 1. □

Proof of Theorem 2

For any t the level of significance of the test T equals α, so, by definition, the Type I error of the test \(T^{'}_{\varphi}\) equals α, too. In order to prove the second statement of the theorem we suppose that the hypothesis H 1 is true. From (8) we can see that there exist constants k 1 and k 2 such that with probability 1

$$ \lim_{t \rightarrow \infty} t^{-1} K(x_1 \ldots x_t) = k_i , $$
(12)

where x 1x t S i , i=1,2, and k 1k 2. Let us suppose that k 1>k 2 and define Δ=k 1k 2. From (1), (12) and Theorem 2 we can see that

$$ \lim_{t \rightarrow \infty} K_\varphi(x_1 \ldots x_t) = k_i $$
(13)

with probability 1 (here x 1x t S i , i=1,2). By definition, it means that for any ϵ>0 and δ>0 there exists such t′ that

$$P \bigl\{ \bigl| K_\varphi(x_1 \ldots x_t) - k_i \bigr| < \epsilon \bigr\} \ge 1-\delta $$

for x 1x t S i , i=1,2, when t>t′. Hence, if ϵ=Δ/4, then with probability at least 1−δ all values K φ (x 1x t ), x 1x t S 1 are less than all values K φ (x 1x t ), x 1x t S 2. So, with probability at least 1−δ a set of the ranked values will look like the right part of (4) and, hence, the hypothesis H 1 will be rejected (for large enough |S i |, i=1,2). Having taken into account that min(|S 1|,|S 2|) →∞ and t→∞, we can see that the last statement is valid for any δ. The theorem is proved. □

Appendix 3: Dictionary of Gull’s Behaviours

Symbol

Gull’s position

Demonstrative postures and actions

Position of wings

Vocalisation

0

sitting on eggs

upright

folding

aggressive call

1

sitting on eggs

upright

folding

no

2

sitting on eggs

oblique

folding

aggressive call

3

sitting on eggs

oblique

folding

long call

4

sitting on eggs

oblique

folding

no

5

sitting on eggs

biting with a beak

folding

no

6

sitting on eggs

snapping with a beak

folding

no

7

sitting on eggs

no

folding

aggressive call

7

sitting on eggs

no

folding

long call

8

sitting on eggs

no

folding

no

9

standing on a nest

upright

folding

aggressive call

a

standing on a nest

upright

folding

no

b

standing on a nest

oblique

stretching

aggressive call

c

standing on a nest

oblique

stretching

long call

d

standing on a nest

oblique

stretching

no

e

standing on a nest

oblique

folding

aggressive call

f

standing on a nest

oblique

folding

long call

g

standing on a nest

oblique

folding

no

h

standing on a nest

oblique

flapping

aggressive call

i

standing on a nest

oblique

flapping

long call

j

standing on a nest

oblique

flapping

no

k

standing on a nest

biting with a beak

stretching

no

l

standing on a nest

biting with a beak

folding

no

m

standing on a nest

biting with a beak

flapping

no

n

standing on a nest

snapping with a beak

stretching

no

o

standing on a nest

snapping with a beak

folding

no

p

standing on a nest

snapping with a beak

flapping

no

q

standing on a nest

slashing with a wing

flapping

aggressive call

r

standing on a nest

slashing with a wing

flapping

long call

s

standing on a nest

slashing with a wing

flapping

no

t

standing on a nest

no

stretching

aggressive call

u

standing on a nest

no

stretching

no

v

standing on a nest

no

folding

aggressive call

w

standing on a nest

no

folding

no

x

standing on a nest

no

flapping

aggressive call

y

standing on a nest

no

flapping

no

z

flying

no

flapping

aggressive call

A

flying

no

flapping

no

B

flying

swooping to attack

flapping

long call

C

flying

swooping to attack

flapping

aggressive call

D

flying

swooping to attack

flapping

no

E

flying

biting with a beak

flapping

no

F

flying

snapping with a beak

flapping

no

G

sitting on a perch

oblique

stretching

aggressive call

H

sitting on a perch

oblique

stretching

long call

I

sitting on a perch

oblique

stretching

no

J

sitting on a perch

oblique

folding

aggressive call

K

sitting on a perch

oblique

folding

long call

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ryabko, B., Reznikova, Z., Druzyaka, A. et al. Using Ideas of Kolmogorov Complexity for Studying Biological Texts. Theory Comput Syst 52, 133–147 (2013). https://doi.org/10.1007/s00224-012-9403-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-012-9403-6

Keywords

Navigation