Error-correcting codes and neural networks

Encoding, transmission and decoding of information are ubiquitous in biology and human history: from DNA transcription to spoken/written languages and languages of sciences. During the last decades, the study of neural networks in brain performing their multiple tasks was providing more and more detailed pictures of (fragments of) this activity. Mathematical models of this multifaceted process led to some fascinating problems about “good codes” in mathematics, engineering, and now biology as well. The notion of “good” or “optimal” codes depends on the technological progress and criteria defining optimality of codes of various types: error-correcting ones, cryptographic ones, noise-resistant ones etc. In this note, I discuss recent suggestions that activity of some neural networks in brain, in particular those responsible for space navigation, can be well approximated by the assumption that these networks produce and use good error-correcting codes. I give mathematical arguments supporting the conjecture that search for optimal codes is built into neural activity and is observable.


Introduction and summary
Recently it became technically possible to record simultaneously spiking activity of large neural populations (cf. Refs. 1, 2 in [21]). The data supplied by these studies show "signatures of criticality". This means that the respective neural populations are functioning near the point of a phase transition if one chooses appropriate statistic models of their behaviour.
In some papers the relevant data for retina were interpreted as arguments for optimality of encoding visual information (cf. [22]), whereas in other works such as [21] it was argued that criticality might be a general property of many models of collective neural behaviour unrelated to the encoding optimality.
In this note I test the philosophy relating criticality with optimality using the results of recent works suggesting models of encoding of stimulus space that utilise the basic notions of the theory of error-correcting codes. The recent collaboration of neurobiologists and mathematicians, in particular, led to the consideration of binary codes used by brain for encoding and storing a stimuli domain such as a rodent's territory through the combinatorics of its covering by local neighbourhoods: see [1,2,24].
These binary codes as they are described in [1,2,24] (cf. also a brief survey for mathematicians [14]) are not good error-correcting codes themselves. However, these codes as combinatorial objects are produced by other neural networks during the stage where a rodent say, studies a new environment. During this formation period, other neural networks play an essential role, and I suggest that the auxiliary codes used in this process are error-correcting ones, changing in the process of seeking optimality.
In laboratory experiments, dynamics of these auxiliary codes is described in terms of criticality, i. e. working near phase transition boundaries.
I approach the problem of relating criticality to optimality of such neural activities using a new statistical model of error-correcting codes explained in [15]. In this model the thermodynamic energy is measured by the complexity of information to be encoded; this complexity in turn being interpreted as the length of a maximally compressed form of this information, Kolmogorov style. As far as I know, the respective Boltzmann partition function based upon L. Levin's probability distributions ( [9,10]) has never appeared before [15] in the theory of codes.
In this setting, the close relationship between criticality and optimality becomes an established (although highly non-obvious) mathematical fact, and I argue that the results of [1,2,24] allow one to transpose it in the domain of neurobiology.
In the main body of the paper below, Sect. 1 is dedicated principally to the mathematics of error-correcting codes, in particular, of good ones, whereas Sect. 2 focusses on mathematical models of neurological data and related problems from our perspective.
I am very grateful to many people whose kind attention and expert opinions helped crystallize views expressed in this article, especially S. Dehaene, and recently W. Levelt who in particular, directed me to the recent survey [18].
I am also happy to dedicate this paper to the inspired researcher Sasha Beilinson, who is endowed with almost uncanny empathy for living inhabitants of this planet! 1 Codes and good codes

Codes
In this article a code C means a set of words of finite length in a finite alphabet A, i. e. a subset C ⊂ n≥1 A n . Here we will be considering only finite codes or even codes consisting of words of a fixed length.
Informally, codes are used for the representation ("encoding") of certain information as a sequence of code words. If code is imagined as a dictionary, then information is encoded by texts, and all admissible in some sense (for example, "grammatically correct") texts constitute a language. The main motivation for encoding information is the communicative function of a language: text can be transmitted from a source to a receiver/target via a channel.
An informal notion of "good" or even "optimal" code strongly depends on the circumstances in which the encoding/transmission/decoding takes place. For example, cryptography provides codes helping avoid unauthorised access and/or falsification of information, whereas construction of error-correcting codes allows receiver to reconstruct with high probability the initial text sent by the source through a noisy channel.
Of course, it is unsurprising that, after the groundbreaking work by Shannon-Weaver, quality of error-correcting codes is estimated in probabilistic terms. One starts with postulating certain statistical properties of transmitted information/noisy channel and produces a class of codes maximising the probability to get at the receiver end the signal close to the one that was sent. In the best case, at the receiver end one should be able to correct all errors using an appropriate decoding program.
In this paper, I will be interested in a large class of error-correcting codes whose quality from the start is appreciated in purely combinatorial terms, without appealing to probabilistic notions at all. The fact that good codes lie near/on the so called asymptotic bound, whose existence was also established by combinatorial means (see [23] and references therein), was only recently interpreted in probabilistic terms as criticality of such codes: cf. [12,15,16].
The class of probability distributions that popped up in this research was introduced in [9]. It involves a version of Kolmogorov complexity, has a series of quite nonobvious properties (e.g. a fractal symmetry), and was subsequently used in [13] in order to explain Zipf's law and its universality.
In the remaining part of this section, I will briefly explain these results.

Combinatorics of error-correcting codes
From now on, we consider finite codes consisting of words of a fixed length C ⊂ A n , n ≥ 1.. The cardinality of the alphabet A is denoted q ≥ 2. If A, C are not endowed with any additional structure (in the sense of Bourbaki), we refer to C as an unstructured code. If such an additional structure is chosen, e.g. A is a linear space over a finite field F q , and C is a linear subspace in it, we may refer to such a code as a linear one, or generally using the name of this structure.
Consider, for example, neural codes involved in place field recognition according to [2,24]. Here q = 2, and in [24] these codes are identified with subsets of F n 2 , but they are not assumed to be linear subspaces.
Besides alphabet cardinality q = q(C) and word length n = n(C), two most important combinatorial characteristics of a code C are the (logarithmic) cardinality k(C) := log q card(C) and the minimal Hamming distance between different code words: In the degenerate case card C = 1 we put d(C) = 0. We will call the numbers q, k = k(C), n = n(C), d = d(C), code parameters and refer to C as an [n, k, d] q -code.
These parameters play the following role in the crude estimates of code quality. The arity q(C) and the number of pairwise different code words q k(C) essentially determine, how many "elementary units" of source information can be encoded by one-word messages. Codes with very large q (as Chinese characters in comparison with alphabetic systems) can encode thousands of semantic units by one-letter words, whereas alphabetic codes with q ≈ 30 in principle allow one to encode 30 4 units by words of length 4.
The minimal distance between two code words gives an upper estimate of the number of letters in a code word misrepresented by noisy channel that can still be safely corrected at the receiving end. Thus, if the channel is not very noisy, a small d(C) might suffice, otherwise larger values of d(C) are needed. When d(C) is large, this slows down the information transmission, because the source is not allowed to use all q n words for encoding, but still must transmit a sequence n-letter words.
Finally, when solving engineering problems, it might be preferable to use structured codes, e.g. linear ones, in order to minimise encoding and decoding efforts at the source/target ends. But in this paper, I will not deal with it.
In the remaining part of the paper, in particular, in discussions of optimality, the alphabet cardinality q is fixed.

From combinatorics to statistics: I. The set of code points and asymptotic bound
Consider an [n, k, d] q -code C as above and introduce two rational numbers: the (relative) transmission rate and the relative minimal distance In (1.1) we used the integer part [k(C)] rather than k(C) itself in order to obtain a rational number, as was suggested in [12]. The larger n is, the closer is (1.1) to the version of R(C) used in [15,16] and earlier works. According to our discussion above, an error-correcting code is good if in a sense it maximises simultaneously the transmission rate and the minimal distance.
A considerable bulk of research in this domain is dedicated either to the construction/engineering of (families of) "good" error-correcting codes or to the proofs that "too good" codes do not exist. Since a choice of the transmission rate in a given situation is dictated by the statistics of noise in a noisy channel, we may imagine this task as maximisation of δ(C) for each fixed R(C).
In order to treat this problem as mathematicians (rather than engineers) do, we introduce the notion of a code point cp(C) := (R(C), δ(C)) ∈ [0, 1] 2 ∩ Q and denote by V q the set of code points of all codes of given arity q. Let the latter set be P q .
Let U q be the closed set of limit points of V q . We will call limit code points elements of V q ∩ U q . The remaining subset of isolated code points is defined as V q \V q ∩ U q .
More than thirty years ago it was proved that U q consists of all points in [0, 1] 2 lying below the graph of a certain continuous decreasing function α q : Moreover, α q (0) = 1, α q (δ) = 0 for 1 − q −1 ≤ δ ≤ 1, and the graph of α q is tangent to the R-axis at (1, 0) and to the δ-axis at (0, 1 − q −1 ). This curve is called the asymptotic bound. For a modern treatment and a vast amount of known estimates for asymptotic bounds, cf. [23].
Thus, an error-correcting code can be considered a good one, if its point either lies in U q and is close to the asymptotic bound, or is isolated, that is, lies above the asymptotic bound.
The main result of [12] was the following description of limit and isolated code points in terms of the computable map cp : P q → V q rather than topology of the unit square.
We will say that a code point x ∈ V q has infinite (resp. finite) multiplicity, if cp −1 (x) ⊂ P q is infinite (resp. finite). [12] (a) Code points of infinite multiplicity are limit points. Therefore isolated code points have finite multiplicity.
The existence of isolated codes is established, but we are pretty far from understanding the whole set of them. According to our criteria, an isolated code having the transmission rate matching the channel noise level would be really the best of all good codes. But it is totally unclear how such codes can be engineered. To the contrary, trials and corrections might lead us close to a crossing of an appropriate asymptotic bound, and we will later stick to this subclass of "optimal codes".

From combinatorics to statistics: II. Crossing asymptotic bound as a phase transition
The main result of [15] (see also [16]) consisted in the suggestion to use on the set P q the probability distribution in which energy level of a code is a version of its Kolmogorov complexity. This distribution was introduced and studied by L. Levin. Furthermore, we proved that the respective Boltzmann partition function using this distribution produces the phase transition curve exactly matching the asymptotic bound α q . In order to make more precise the analogy with phase transition in physical systems, it is convenient to introduce the function inverse to α q which we will denote β q . This means that the equation of asymptotic bound is now written in the form δ = β q (R), and the domain below α q is defined by the inequality δ < β q (R).
It was argued in [16], sec. 3, that with our partition function, the transmission rate is a version of density, whereas the curve δ = β q (R) becomes an analog of the (temperature, density)-phase transition curve.

Criticality and optimality for error-correcting codes
Imagine now that the source can try various codes whose transmission rate lies in a small neighbourhood of a chosen value, and, after feedback from the target, decide whether the chosen code is better than the previously tested ones. Then, dynamic statistical characteristics of this trial and error attempts below the asymptotic boundary will show signatures of criticality: cf. [7,8,22].

Neural encodings of stimulus spaces 2.1 Spatial maps in the brain I: orientation and navigation in a familiar territory
In this subsection, I briefly survey information about position and functioning of those neural networks in brain that encode the local positions and navigation of an animal in the world. For much more details and references, see [1,6], and reference therein.
It is postulated that a given type of stimuli (topographical one, or, say, semantic one as in [17,20]), can be modelled via a topological, or metric stimuli space X . Furthermore, brain reaction to a neighbourhood of a point in X is modelled by spiking activity of a finite set of neurons N X .
In turn, measurable data of such an activity are simplified in order to obtain a bijective correspondence between elements of a certain finite covering of X by subsets U := (U i |i = 1, . . . , n)) and certain subsets of N X , "cell groups". Namely, if the animal is in U i , the neurons in the respective cell group "fire significantly above baseline within a broad (about 250 ms) temporal window" ([1], p. 2.) Encoding firing neurons by 1 and remaining ones by 0, we thus construct a binary code C ⊂ A n whose words correspond to local positions of the animal, whereas various paths of the moving animal are translated by sequences of code words.
Moreover, combinatorics of the code C reflects combinatorics of the mutual intersections of local neighbourhoods (U i ). Namely, where for a binary word w, supp (w) consists of all positions where the respective letter of W is 1. See [24] or a brief account in [14], sec. 2.3, for further information on the topological content of such encoding.

The spatial maps in the brain II: study of a new territory and/or transition between two different familiar territories
In this subsection, I will briefly discuss dynamics of neurons and neural nets responsible for place recognition, and their topography in brain. Here I heavily rely upon [6].
(a) Neurons in N X above are called place cells. They are located in hippocampus.
Moreover, in the study of new territories and passage between territories a very important role is played by additional classes of cortical cells: (b) Grid cells in parahippocampal entorhinal brain regions encode coordinate grid, functionally similar to the meridians and parallels grid in the maps of earth surface. However, grid pattern is hexagonal rather than tetragonal.
Head direction cells encode the axial position of an animal, border cells fire in a neighbourhood of the border of the territory.
(c) When an animal moves from one local part of the environment to another local part which is already encoded by a certain group of place cells, the respective encoding is extracted from memory where it is stored.
However, if an animal enters an unexplored, or recently changed, local part of environment, a very specific type of neural activity must accommodate the exploration and encoding of geometry of this local part. In particular, signals from retina must find their way to the place cells, and be transformed into the respective code. This code is not unique, it depends of the choice of covering of this local part, which is formed dynamically during the exploration process.

Where good error-correcting codes might be needed?
The first tentative answer to this question that comes to mind is that codes (2.1) themselves must be optimal ones: it would be reasonable to expect this as a condition for efficient space navigation.
However, it is easy to see that combinatorics of the covering can be such that minimal Hamming distance between two code words is small, for example just 1. Namely, call a code C ⊂ {0, 1} n simplicial one, if for each w ∈ C and each v ∈ A n with supp v ⊂ supp w we have v ∈ C. These codes naturally appear in (2.1), and they are often quite bad by the standard criteria.
In this note, I argue that good error-correcting codes must be looked for at a higher level ("upstream"): these codes presumably are formed when a brain learns to produce new place recognising neural codes, and to projecting local maps into entorhinal more extended maps. Thus, the search of good error-correcting codes is an intermediate process, that is essential in the transfer of information from, say, retina, to hippocampal or entorhinal cells.
Moreover, I argue that the formation of such a code can be modelled by the source/channel/target scheme, and that the optimization of the code, as suggested by [GaRiVe03], starts with the stage of source compression. The main new insight is the suggestion that appropriate mathematical model of this optimization is a choice of a Kolmogorov optimal short program capable to generate possibly much longer source data.
Then, according to the philosophy explained in Sects. 1.4-1.5 above, dynamics of this choice might generate criticality signature related to the phase transition near the asymptotic bound.

Pro and contra arguments
Arguments against the conjecture that (an approximation to) the Kolmogorov complexity might be used by brain as criterium of good source compression mostly concern unusual properties of complexity and the relevant Levin's probability distribution.
Below I will briefly state two such arguments and collect the respective counter arguments.
A. Exact mathematical definition of Kolmogorov complexity can be applied only to a potentially infinite set of objects that can be effectively encoded, say, by all finite words in a given finite alphabet, or else by texts in a given finite alphabet. Then complexity produces the encoding of comparable length for most of these data, but a much shorter one for infinite and infinitely rare minority of them.
In fact, the level of compression approximating Kolmogorov's one is very clearly visible already in the experiments/data bases related to initial fragments of the potentially infinite list whose length does not exceed, say, 10 4 or 10 6 : cf. [4,5] and a detailed discussion in [13].
It is less clear, how large is the volume of data needed to be encoded for efficient space navigation. In [6] where this picture based on experiments on rodents is surveyed, the question is asked whether human brains represent space navigation, or other information, using the same principles as rodents' brains. If this is true, then certainly human minds have to encode, store, and recall volumes of information sufficient to see effects of Komogorov's efficiency.
At least some research suggests that the answer must be positive. In particular the paper [6] itself refers to it in Concluding Remarks. The localisation of orientation/navigation neural networks in human hippocampus was also confirmed by the study [11] of licensed London taxi drivers. In the pre-GPS tracker's era, taxi drivers had to pass an intense two-years training for orientation in London. It turned out that the total mass of hippocampal place cells of the licensed drivers was considerably larger that in the control group. Assuming that in human's brains the same principles of encoding are used, one can see that already at this volume of data, compressed encoding might be efficiently used.
For this author, an even more convincing argument is the fact of universality of Zipf's law and suggestion in [13] that it is explained by L. Levin's probability distribution, for which probability of the use of an encoded object is a version of the inverse exponential Kolmogorov complexity.
More generally, amply discussed in the literature idea of "semantic spaces" (cf. [17] and references therein) suggests that the encoding and use of linguistic data in human brains has strong parallels with those in space navigation. B. The model of criticality used, say, in [19] and earlier works on neural activity differs from the one used here. In particular, divergence of specific heat as function of temperature is often invoked.
My main argument defending asymptotic bound for error-correcting codes as the appropriate phase transition curve is that it supplies a strong intrinsic mathematical background for the study of the statistics of large volumes of information used and stored in brain.
Criticality as behaviour of a neural system seeking for good encoding is also intuitively very natural, if one takes in account that although inverse exponential complexity is not computable, but it is "computable from above" in a precise mathematical sense.
Finally, inverse exponential Kolmogorov complexity has very strong fractal and self-similarity properties. For all this, see [15,16]. To the contrary, intuitively natural code counting statistics, using e.g. Shannon's Random Code Ensemble, decidedly overlook good codes lying near the asymptotic bound: cf. [16], subsections 1.3-1.6.