Abstract
Kolmogorov complexity furnishes many useful tools for studying different natural processes that can be expressed using sequences of symbols from a finite alphabet (texts), such as genetic texts, literary and music texts, animal communications, etc. Although Kolmogorov complexity is not algorithmically computable, in a certain sense it can be estimated by means of data compressors. Here we suggest a method of analysis of sequences based on ideas of Kolmogorov complexity and mathematical statistics, and apply this method to biological (ethological) “texts.” A distinction of the suggested method from other approaches to the analysis of sequential data by means of Kolmogorov complexity is that it belongs to the framework of mathematical statistics, more specifically, that of hypothesis testing. This makes it a promising candidate for being included in the toolbox of standard biological methods of analysis of different natural texts, from DNA sequences to animal behavioural patterns (ethological “texts”). Two examples of analysis of ethological texts are considered in this paper. Theses examples show that the proposed method is a useful tool for distinguishing between stereotyped and flexible behaviours, which is important for behavioural and evolutionary studies.
Similar content being viewed by others
References
Anel, C., Sanderson, M.J.: Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. Syst. Biol. 54(1), 146–157 (2005)
Billingsley, P.: Ergodic Theory and Information. Wiley, New York (1965)
Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New York (2006)
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinf. 8 (2007)
Fisher, R.A.: Statistical Methods, Experimental Design, and Scientific Inference. Oliver & Boyd, Edinburgh (1956)
Gallager, R.G.: Information Theory and Reliable Communication. Wiley, New York (1968)
Groothuis, T.: The influence of social experience on the development and fixation of the form of displays in the black-headed gull. Anim. Behav. 43(1), 1–14 (1992)
Hutter, M.: Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic Probability. Springer, Berlin (2005)
Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics, Inference and Relationship, vol. 2. Griffin, London (1961)
KGB archiver (v. 1.2). http://www.softpedia.com/get/Compression-tools/KGB-Archiver.shtml
Kieffer, J., Yang, E.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46, 737–754 (2000)
Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)
Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.Y.: An information based distance and its application to whole mitochondrial genome phylogeny. Bioinformatics (Oxford) 17, 149–154 (2001)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)
Lloyd, E. (ed.): Handbook of Applied Statistics, vol. 2. Wiley-Interscience, New York (1984)
McCowan, B., Doyle, L.R., Hanser, S.F.: Using information theory to assess the diversity, complexity, and development of communicative repertoires. J. Comp. Psychol. 116(2), 166–172 (2002)
Oller, D.K., Griebel, U. (eds.): Evolution of Communicative Flexibility: Complexity, Creativity, and Adaptability in Human and Animal Communication. MIT Press, Cambridge (2008)
Panteleeva, S., Danzanov, Zh., Reznikova, Zh.: Estimate of complexity of behavioral patterns in ants: analysis of hunting behavior in myrmica rubra (hymenoptera, formicidae) as an example. Entomol. Rev. 91(2), 221–230 (2011)
Reznikova, Z.: Animal Intelligence: From Individual to Social Cognition. Cambridge University Press, Cambridge (2007)
Reznikova, Zh., Panteleeva, S.: An ant’s eye view of culture: propagation of new traditions through triggering dormant behavioural patterns. Acta Ethol. 11, 73–80 (2008)
Rissanen, J.: Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory 30(4), 629–636 (1984)
Ryabko, B.: Prediction of random sequences and universal coding. Probl. Inf. Transm. 24(2), 87–96 (1988)
Ryabko, B., Reznikova, Zh.: Using Shannon entropy and Kolmogorov complexity to study the communicative system and cognitive capacities in ants. Complexity 2, 37–42 (1996)
Ryabko, B., Reznikova, Z.: The use of ideas of information theory for studying “language” and intelligence in ants. Entropy 11, 836–853 (2009)
Ryabko, D., Schmidhuber, J.: Using data compressors to construct order tests for homogeneity and component independence. Appl. Math. Lett. 22(7), 1029–1032 (2009)
Ryabko, B., Astola, J., Gammerman, A.: Application of Kolmogorov complexity and universal codes to identity testing and nonparametric testing of serial independence for time series. Theor. Comput. Sci. 359, 440–448 (2006)
Tinbergen, N.: An objective study of the innate behaviour of animals. Bibl. Biotheor. 1, 39–98 (1942)
Tinbergen, N.: The Study of Instinct. Oxford University Press, London (1951)
Vitanyi, P.M.B.: Information distance in multiples. IEEE Trans. Inf. Theory 57(4), 2451–2456 (2011)
Yaglom, A.M., Yaglom, I.M.: Probability and Information. Theory and Decision Library. Springer, Berlin (1983)
Zvonkin, A.K., Levin, L.A.: The complexity of finite objects and concepts of information and randomness through the algorithm theory. Russ. Math. Surv. 25(6), 83–124 (1970)
Author information
Authors and Affiliations
Corresponding author
Additional information
Research was supported by Russian Foundation for Basic Research (grants 12-07-00125 and 11-04-00536), by the Integrated Project of Siberian Branch RAS (grant N 21), the Program “Living Nature” of the Presidium of Russian Academy of Science (grant No. 30.6), and by the Program of cooperative investigations of SB RAS and third parties (grant No. 63).
Appendices
Appendix 1: Universal Codes, Shannon Entropy and Kolmogorov Complexity
First we briefly describe stochastic processes (or sources of information). Consider a finite alphabet A, and denote by A t and A ∗ the set of all words of length t over A and the set of all finite words over A correspondingly (\(A^{*} = \bigcup_{i=1}^{\infty}A^{i}\)).
A process P is called stationary if
for all t,k∈N and all (a 1,…,a k )∈A k. A stationary process is called stationary ergodic if the frequency of occurrence of every word a 1,…,a k converges (a.s.) to P(a 1,…,a k ). For more details see [2, 4, 7].
Let τ be a stationary ergodic source generating letters from a finite alphabet A. The m-order (conditional) Shannon entropy and the limit Shannon entropy are defined as follows:
[2, 7]. The well known Shannon-MacMillan-Breiman theorem states that
with probability 1, see [2, 4, 7].
Now we define codes and Kolmogorov complexity. Let A ∞ be the set of all infinite words x 1 x 2… over the alphabet A. A data compression method (or code) φ is defined as a set of mappings φ n such that φ n :A n→{0,1}∗,n=1,2,… and for each pair of different words \(x,y \in A^{n} \:\) φ n (x)≠φ n (y). Informally, it means that the code φ can be applied for compression of each message of any length n over the alphabet A and the message can be decoded if its code is known. It is also required that each sequence φ n (u 1)φ n (u 2)…φ n (u r ),r≥1, of encoded words from the set A n,n≥1, can be uniquely decoded into u 1 u 2…u r . Such codes are called uniquely decodable. For example, let A={a,b}, the code ψ 1(a)=0, ψ 1(b)=00, obviously, is not uniquely decodable. It is well known that if a code φ is uniquely decodable then the lengths of the codewords satisfy the following inequality (Kraft inequality): \(\Sigma_{u \in A^{n}}\: 2^{- |\varphi_{n} (u) |} \leq 1\), see, for ex., [4, 7].
In this paper we will use the so-called prefix Kolmogorov complexity, whose precise definition can be found in [9, 13]. Its main properties can be described as follows. There exists a uniquely decodable code κ such that (i) there is an algorithm for decoding (i.e. there is a Turing machine, which maps κ(u) to u for every u∈A ∗) and (ii) for any uniquely decodable code ψ, whose decoding is algorithmically realizable, there exists a non-negative constant C ψ that
for every u∈A ∗; see Theorem 3.1.1 in [13]. The prefix Kolmogorov complexity K(u) is defined as the length of κ(u): K(u)=|κ(u)|. The code κ is not unique, but the second property means that codelengths of two codes κ 1 and κ 2, for which (i) and (ii) are true, are equal up to a constant: ||κ 1(u)|−|κ 2(u)||<C 1,2 for any word u (and the constant C 1,2 does not depend on u, see (8)). So, K(u) is defined up to a constant. In what follows we call this value “Kolmogorov complexity”.
We can see from (ii) that the code κ is asymptotically (up to a constant) the best method of data compression, but it turns out that there is no algorithm that can calculate the codeword κ(u) (and even K(u)). That is why the code κ (and Kolmogorov complexity) cannot be used for practical data compression directly.
The following Claim is by Levin [32, Proposition 5.1]:
Claim
For any stationary ergodic source τ
with probability 1.
Comment
In [32] this claim is formulated for “common” Kolmogorov complexity, but it is also valid for the prefix Kolmogorov complexity, because for any word x 1…x t the difference between both complexities equals O(logt), see [13].
Let us describe universal codes, or data compressors. For their description we recall that (as it is known in Information Theory) sequences x 1…x t , generated by a source p, can be “compressed” till the length −logp(x 1…x t ) bits and, on the other hand, there is no code ψ for which the expected codeword length \(( \Sigma_{x_{1} \ldots x_{t} \in A^{t}} p(x_{1} \ldots x_{t}) |\psi(x_{1} \ldots x_{t})|)\) is less than \(- \Sigma_{x_{1} \ldots x_{t} \in A^{t}} \allowbreak p(x_{1} \ldots x_{t})\allowbreak \log p(x_{1} \ldots x_{t})\). The universal codes can reach the lower bound −logp(x 1…x t ) asymptotically for any stationary ergodic source p with probability 1. The formal definition is as follows: A code φ is universal if for any stationary ergodic source p
with probability 1. So, informally speaking, universal codes estimate the probability characteristics of the source p and use them for efficient “compression.”
Appendix 2: Proofs of Theorems
Proof of Theorem 1
For any universal code from the Shannon-McMillan-Breiman theorem (7) and the definition (10) we obtain the following equation
with probability 1. Having taken into account this equality and (9), we obtain the statement of the Theorem 1. □
Proof of Theorem 2
For any t the level of significance of the test T equals α, so, by definition, the Type I error of the test \(T^{'}_{\varphi}\) equals α, too. In order to prove the second statement of the theorem we suppose that the hypothesis H 1 is true. From (8) we can see that there exist constants k 1 and k 2 such that with probability 1
where x 1…x t ∈S i , i=1,2, and k 1≠k 2. Let us suppose that k 1>k 2 and define Δ=k 1−k 2. From (1), (12) and Theorem 2 we can see that
with probability 1 (here x 1…x t ∈S i , i=1,2). By definition, it means that for any ϵ>0 and δ>0 there exists such t′ that
for x 1…x t ∈S i , i=1,2, when t>t′. Hence, if ϵ=Δ/4, then with probability at least 1−δ all values K φ (x 1…x t ), x 1…x t ∈S 1 are less than all values K φ (x 1…x t ), x 1…x t ∈S 2. So, with probability at least 1−δ a set of the ranked values will look like the right part of (4) and, hence, the hypothesis H 1 will be rejected (for large enough |S i |, i=1,2). Having taken into account that min(|S 1|,|S 2|) →∞ and t→∞, we can see that the last statement is valid for any δ. The theorem is proved. □
Appendix 3: Dictionary of Gull’s Behaviours
Symbol | Gull’s position | Demonstrative postures and actions | Position of wings | Vocalisation |
---|---|---|---|---|
0 | sitting on eggs | upright | folding | aggressive call |
1 | sitting on eggs | upright | folding | no |
2 | sitting on eggs | oblique | folding | aggressive call |
3 | sitting on eggs | oblique | folding | long call |
4 | sitting on eggs | oblique | folding | no |
5 | sitting on eggs | biting with a beak | folding | no |
6 | sitting on eggs | snapping with a beak | folding | no |
7 | sitting on eggs | no | folding | aggressive call |
7 | sitting on eggs | no | folding | long call |
8 | sitting on eggs | no | folding | no |
9 | standing on a nest | upright | folding | aggressive call |
a | standing on a nest | upright | folding | no |
b | standing on a nest | oblique | stretching | aggressive call |
c | standing on a nest | oblique | stretching | long call |
d | standing on a nest | oblique | stretching | no |
e | standing on a nest | oblique | folding | aggressive call |
f | standing on a nest | oblique | folding | long call |
g | standing on a nest | oblique | folding | no |
h | standing on a nest | oblique | flapping | aggressive call |
i | standing on a nest | oblique | flapping | long call |
j | standing on a nest | oblique | flapping | no |
k | standing on a nest | biting with a beak | stretching | no |
l | standing on a nest | biting with a beak | folding | no |
m | standing on a nest | biting with a beak | flapping | no |
n | standing on a nest | snapping with a beak | stretching | no |
o | standing on a nest | snapping with a beak | folding | no |
p | standing on a nest | snapping with a beak | flapping | no |
q | standing on a nest | slashing with a wing | flapping | aggressive call |
r | standing on a nest | slashing with a wing | flapping | long call |
s | standing on a nest | slashing with a wing | flapping | no |
t | standing on a nest | no | stretching | aggressive call |
u | standing on a nest | no | stretching | no |
v | standing on a nest | no | folding | aggressive call |
w | standing on a nest | no | folding | no |
x | standing on a nest | no | flapping | aggressive call |
y | standing on a nest | no | flapping | no |
z | flying | no | flapping | aggressive call |
A | flying | no | flapping | no |
B | flying | swooping to attack | flapping | long call |
C | flying | swooping to attack | flapping | aggressive call |
D | flying | swooping to attack | flapping | no |
E | flying | biting with a beak | flapping | no |
F | flying | snapping with a beak | flapping | no |
G | sitting on a perch | oblique | stretching | aggressive call |
H | sitting on a perch | oblique | stretching | long call |
I | sitting on a perch | oblique | stretching | no |
J | sitting on a perch | oblique | folding | aggressive call |
K | sitting on a perch | oblique | folding | long call |
Rights and permissions
About this article
Cite this article
Ryabko, B., Reznikova, Z., Druzyaka, A. et al. Using Ideas of Kolmogorov Complexity for Studying Biological Texts. Theory Comput Syst 52, 133–147 (2013). https://doi.org/10.1007/s00224-012-9403-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-012-9403-6