Introduction

The term “predictive coding” is nowadays often used to refer to a family of models of perceptual inference in the hierarchically organized visual cortex. It is a diverse family, with various Bayesian and artificial neural network models, some of which can process images while others cannot. It is also a family with a divide regarding the roles of feedforward and feedback connections. One set of models assume that feedback connections carry predictions while feedforward connections carry prediction errors (e.g., Rao & Ballard, 1999). The other set of models assume that feedforward connections carry predictions while feedback connections carry constraints on these predictions (e.g., Lee & Mumford, 2003). In this article, I contrast a recent Bayesian model from the former set of models to a long-standing representational model close to the latter set of models.

More specifically, focusing on theoretical aspects, I contrast Friston’s (2009, 2010) Bayesian version of predictive coding to the representational approach called structural coding (see van der Helm, 2014). Like other predictive coding models, these two models aim at unifying competence (i.e., what is a system’s output?) and performance (i.e., how does the system arrive at this output?). Special is that both models use free-energy minimization as metaphor for processing in the brain, but with totally different elaborations of this metaphor. One difference is that, in free-energy (FE) predictive coding, predictions are based on probabilities, whereas in structural coding, they are based on descriptive complexities.Footnote 1 This is basically merely a difference in means, but it led, in these two coding approaches, to fundamentally different views on hierarchical perceptual inference. The core ideas of the two coding approaches may be introduced briefly as follows.

FE predictive coding, on the one hand, draws on von Helmholtz’s (1909/1962) idea, also known as the likelihood principle, that we perceive the most likely objects or events that would fit the sensory input that we are trying to interpret (cf. Hochberg, 1978; Gregory, 1973; Pomerantz & Kubovy, 1986). Strong versions take ”most likely” to refer to objective probabilities in the world (which does not seem tenable; Feldman, 2013; van der Helm, 2000, 2011), but Bayesians usually take it to refer to subjective probabilities, or beliefs. In any case, FE predictive coding assumes that the visual system tests predictions in a top-down fashion—along recurrent (or feedback, or reentrant, or descending) neural connections—against the sensory input. Prediction errors are returned in a bottom-up fashion—along feedforward (or ascending) connections—to update the to-be-recycled predictions (Bastos et al., 2012). This process is driven by prediction-error reduction, which is seen as reflecting free-energy minimization (Friston, 2009, 2010) and which is formulated in terms of Shannon’s (1948) classical information theory.

Structural coding, on the other hand, draws on the Gestalt law of Prägnanz (Koffka, 1935; Köhler, 1920, 1929; Wertheimer, 1912, 1923). This law was inspired by the idea that the brain, like any physical system, tends to settle in relatively stable states defined by a minimum of free energy. It is generally understood to refer to a tendency towards regularity, symmetry, and simplicity, or as Koffka (1935) formulated it for vision: ”Of several geometrically possible organizations that one will actually occur which possesses the best, the most stable shape” (p. 138). Building on this Gestaltist idea and on seminal work by, for instance, MacKay (1950), Hochberg and McAlister (1953), Attneave (1954), and Garner (1962), structural coding began as a competence model (Leeuwenberg, 1968), but nowadays, it also includes performance (van der Helm, 2012, 2014, 2015a).

Structural coding assumes that the perceptual process in the visual hierarchy in the brain comprises three neurally intertwined subprocesses, namely, feedforward extraction of visual features from sensory input, horizontal (or lateral) binding of similar features, and recurrent selection of different features to be integrated into percepts (cf. Lamme, Supèr, & Spekreijse, 1998). These three subprocesses, together, are assumed to yield simplest hierarchical organizations of sensory input—that is, organizations in terms of wholes and parts, which are formally definable by a minimum number of descriptive parameters. This idea—which reflects Occam’s razor—is also known as the simplicity principle. It is in line with modern information theory, which may need some further introduction.

Modern information theory arose in reaction to Shannon’s (1948) classical information theory. Whereas classical information theory requires knowledge of actual probabilities to optimize things, modern information theory aims to do more or less the same without needing to know actual probabilities. Since the 1960s, it developed into algorithmic information theory (AIT) in mathematics (see Li & Vitányi, 1997), and independently, into structural information theory (SIT) in human visual perception research (see Leeuwenberg & van der Helm, 2013). There are differences between AIT and SIT, but currently relevant, they share the Occamian idea that the simplest interpretation of data is the best oneFootnote 2 (for discussions on these issues, see van der Helm, 2000, 2011, 2014).

Hence, the idea that visual perception is a form of unconscious inference governed by free-energy minimization is a long-standing Gestaltist idea adopted first by structural coding and only later by FE predictive coding (which, to my knowledge, has been silent about these historical roots). However, the two coding approaches differ fundamentally regarding the questions of (a) how this idea is cast in information-theoretic terms, and (b) what the underlying neural mechanisms entail. The former question is a competence question, which I address first—also because it provides a leg up to the latter question, which is a performance question. To focus on the broader conceptual issues, I skip technical details of both coding approaches (these can be found elsewhere). For the same reason, I elaborate neither on the wealth of neurophysiological data that is claimed to support FE predictive coding (see Clark, 2013), nor on the wealth of behavioral data that is claimed to support structural coding (see Leeuwenberg & van der Helm, 2013).

Competence

As said, whereas predictions in FE predictive coding are based on probabilities, predictions in structural coding are based on descriptive complexities. However, a descriptive complexity C can be converted into the artificial probability p a =2C, which is called an algorithmic probability in AIT (Li & Vitányi, 1997) and a precisal in SIT (van der Helm, 2000). It reflects that simpler things are assigned higher probabilities, and implies that structural coding can be given a Bayesian formulation too (see Fig. 1). Before discussing this further, it is expedient to contrast this modern information-theoretic notion of precisal to the classical information-theoretic notion of surprisal (term by Tribus, 1961), which plays a role in FE predictive coding (where it is called “surprise”).

Fig. 1
figure 1

Objective or subjective probabilities p can be used to maximize Bayesian certainty, and via the surprisal conversion from classical information theory (classical IT), also to minimize information as quantified in classical IT. Descriptive complexities C can be used to minimize information as quantified in modern information theory (modern IT), and via the precisal conversion from modern IT, also to maximize Bayesian certainty under these probabilities

The precisal, on the one hand, is a probability derived from a descriptive complexity, that is, from an information quantification based on a description of an individual message (e.g., a hypothesis), whose hierarchical internal structure reflects that of the message. The surprisal, on the other hand, is Shannon’s (1948) solution to get an optimal encoding of messages, that is, to minimize the long-term average burden on communication channels given the transmission probabilities of pre-chosen messages. The surprisal of a message is the negative logarithm of its transmission probability relative to those of all other possible messages, and optimal encoding is achieved by labeling all messages with arbitrary nominalistic codes the length of their surprisals. Thus, more likely messages are assigned shorter labels (as, e.g., in the Morse Code). There is some debate in mathematics whether precisals form a proper probability distribution, but notice that the surprisal is definitely not a descriptive complexity: It is an information quantification based on a message’s probability and is unrelated to the message’s internal structure. Hence, as van der Helm (2000, 2011) argued earlier, it is factually incorrect to claim that the two information quantifications are formally equivalent, as either implicitly or explicitly has been claimed by Chater (1996), Friston (2010), and Thornton (2014), for instance.

Bayesian modeling

Bayes’ rule (Bayes & Price, 1763) is a powerful mathematical modeling tool given by:

$$p(H|D) = \frac{p(H)*p(D|H)}{p(D)} $$

In words, Bayes’ rule holds that, for data D to be explained, the posterior probability p(H|D) of hypothesis H is proportional to the prior probability p(H) of H, multiplied by the conditional probability p(D|H) of D if H were true. The probability p(D) of D is the normalization factor. In general, Bayesian approaches aim to establish a posterior probability distribution over the hypotheses, but a specific goal is to select the most likely hypothesis, that is, the one with the highest posterior probability under the employed prior and conditional probabilities. To formulate this specific goal, the normalization factor p(D) can be omitted, yielding:

$$\text{Select the}\ H\ \text{that maximizes}\quad p(H|D) = p(H) * p(D|H) $$

In perceptual organization, Bayes’ rule can be applied to determine the posterior probability p(H|D) of a candidate interpretation H of sensory data D. Such an interpretation, or scene model, comprises a hypothesized organization of the distal stimulus, that is, it comprises hypothesized distal objects that could fit the sensory data. The prior p(H) then is the probability of interpretation H independently of sensory data D, that is, it can be said to indicate how good hypothesis H is in itself (it is therefore also said to account for view-independent properties of H). Furthermore, the conditional p(D|H) then is the probability of sensory data D if interpretation H were true, that is, it can be said to indicate how well data D fit hypothesis H (it is therefore also said to account for view-dependent properties of H).

In FE predictive coding, hypotheses are assumed to be given beforehand, and prediction errors are defined by conditional surprisals, that is, by the negative logarithm of p(D|H). So, in classical information-theoretic terms, it aims to minimize the surprisal of data D given hypothesis H. In structural coding, conversely, hypotheses are assumed to be constructed on the fly from the sensory data, and in modern information-theoretic terms,Footnote 3 it aims to minimize the sum of (a) the prior complexity of hypothesis H, and (b) the conditional complexity of data D given hypothesis H (Fig. 2 gives a gist). In Bayesian terms, FE predictive coding aims to maximize conditional probabilities,Footnote 4 while structural coding aims to maximize the product of prior and conditional precisals.

Fig. 2
figure 2

Each of these configurations can be interpreted as consisting of a long segment and a short segment. The prior complexity of this two-objects hypothesis reflects the effort to construct these segments, and the conditional complexities reflect the effort to bring the segments in each of the given positions. For details of the quantification of conditional complexities, see van Lier et al. (1994), but it corresponds roughly to the intuitively assessed number of positional degrees of freedom to be removed to arrive at a given position. This implies it increases gradually in (a)–(d). In (c) and (d), the relatively high conditional complexities imply that the one-object hypothesis is predicted to prevail, while in (b), the two-objects hypothesis is predicted to prevail (confirmed by Feldman, 2007). The latter agrees with common ideas that such T-junctions between the contours of two shapes are cues that one shape occludes the other

Furthermore, as said, in Bayesian models, probabilities usually are beliefs, that is, probabilities based on an individual’s past experience, or knowledge,Footnote 5 while in structural coding, they are precisals, that is, probabilities derived from descriptive complexities. Notice that precisals can be said to reflect a belief but that not every belief is reflected by precisals. This may seem obvious, but as I discuss next, it yet deserves clarification.

No automatic inclusion of Occam’s razor

Bayes’ rule is a selection method, whereas Occam’s razor, or the simplicity principle, is a selection criterion—just as the Helmholtzian likelihood principle. Bayesian models can accommodate any selection criterion, so, also Occam’s razor. However, there is a persistent misconception that every Bayesian model agrees automatically with Occam’s razor. This misconception seems to have arisen in the early 1990s, when it also got its first refutation by Wolpert (1995). It reappeared in Chater’s (1996) claim, reiterated by Feldman (2009), that the simplicity and likelihood principles are equivalent, which was refuted by van der Helm (2000, 2011). Nevertheless, invoking Chater (1996) and Feldman (2009), Thornton (2014) persisted—crucially flawed in that he ignored the fundamentally different ways in which classical and modern information theory quantify information (see the beginning of section “Competence”). Furthermore, just as Feldman (2009), Thornton (2014) invoked an argument by MacKay (2003), which van der Helm (2011) had refuted as follows.

MacKay argued that a category of more complex instances spreads probability mass over more instances than a category of simpler instances does, so that such simpler instances tend to get higher probabilities. Notice that this presupposes (a) a correlation between complexity and category size, and (b) that every category gets an equal probability mass. These presuppositions are inherent neither to Bayes’ rule nor to the Helmholtzian likelihood paradigm. In fact, they are at the heart of the following, insightful, reasoning about the reliability of simplicity as predictor.

Imagine a world with objects generated by, each time, first selecting randomly a complexity category, and then selecting randomly an instance from that category. Thus, in the first step, all categories have the same probability of being selected, and in the second step, all instances in the selected category have the same probability of being selected. By definition, instances in a category of complexity C are describable by C parameters, so, the category size is proportional to 2C. This implies that the probability that a particular instance is selected is proportional to 2C—which, notably, is the earlier-mentioned precisal p a . Hence, in this particular kind of world—which MacKay seemed to have in mind—the simplicity and likelihood principles are equivalent, but notice that this says nothing about how these principles are related in other imagined or actual worlds.

In other words, MacKay’s argument is not an argument that Bayesians can use to claim automatic inclusion of Occam’s razor, but it is one that Occamians might use to promote Occam’s razor as a belief worthy of building Bayesian models on. In AIT, this belief has been supported by showing, among other things, that simplest descriptive codes yield near-optimal encoding if the actual probability distribution is one from the infinite set of enumerable probability distributions (Li & Vitányi, 1997). Thus, simplest descriptive codes can be said to have a general-purpose nature in that they yield fairly optimal encoding in many imaginable worlds.Footnote 6 As I discuss in a moment, something similar holds for the veridicality of simplest descriptive codes.

Hence, Bayesian models can comply with Occam’s razor, but they do not comply automatically with it. To comply with Occam’s razor, one would have to start from precisals or, if one prefers to use objective probabilities, one would have to assume a world like the one MacKay (2003) apparently had in mind. As far as I can tell, this holds for all types of Bayesian models. That is, it holds for both parametric Bayesian models (in which predictions depend on chosen belief parameters, as, e.g., in FE predictive coding) and nonparametric Bayesian models (where “nonparametric” means that belief parameters are adjusted on the fly as the incoming data are gathered; for such models of cognition, see, e.g., Austerweil & Griffiths, 2013). It also holds for hierarchical Bayesian inference models, which I discuss in the section on performance, because they are intimately related to ideas about neural implementation. By way of prelude to this, but still pertaining to competence, I next discuss plain Bayesian inference.

Bayesian inference and the role of action in perception

Bayesian inference is basically the recursive application of Bayes’ rule. In perceptual organization, this general technique is particularly convenient to model visual updating by moving observers, as van der Helm (2000) explicated as follows (see also Fig. 3).

Fig. 3
figure 3

Everyday perception by moving observers a You take a first glance at a scene b You probably interpret it as a black shape occluding this grey shape c You move, and what you see then may trigger a visual update leading to a revision of your first interpretation

A moving observer usually gets a growing sample D of different views (i.e., proximal stimuli) of the same distal scene. Suppose sample D consist, at first, of only one view, with H i (i=1,2,...) as candidate interpretations and with prior and conditional probabilities p(H i ) and p(D|H i ), so that the posterior probabilities p(H i |D) can be determined by applying Bayes’ rule. Then, each time an additional view enters the sample D, the previously computed posterior probabilities p(H i |D) can be taken as the new prior probabilities p(H i ) which, together with the conditional probabilities p(D|H i ) for the expanded sample D, can be used to determine new posterior probabilities by again applying Bayes’ rule. This recursive application of Bayes’ rule is not guaranteed to converge always on one interpretation (cf. Diaconis & Freedman, 1986), but generally, it converges on one interpretation, which, under the employed conditionals, will continue to get the highest posterior when sample D is expanded further (cf. Li & Vitányi, 1997).

Hence, if one has (approximately) the right conditional probabilities, then several (not too atypical) views of a distal scene suffice to make a (fairly) reliable inference about what the distal scene comprises and, thereby, what subsequent views will show. That is, the trick of the recursive application of Bayes’ rule is that, after several recursions, the effect of the first priors fades away because the priors are updated continuously on the basis of the conditionals, which, thereby, become the decisive entities. This useful trick brings me to the next two observations on the role of action in perception.

First, AIT found that the margin between precisals and probabilities from an enumerable probability distribution P is maximally equal to the complexity of P (Li & Vitányi, 1997)—this complexity corresponds roughly to the number of categories to which P assigns probabilities. This holds for priors and conditionals, and again illustrates the general-purpose nature of simplest descriptive codes, which, by this finding, can be said to be fairly veridical in many imaginable worlds. For perception, this can be sharpened as follows. The number of prior categories in the world is very high, so that prior precisals are probably not very veridical. However, the number of conditional categories for a specific hypothesis is relatively small, that is, there usually are few qualitatively different views of a scene—this suggests that conditional precisals are pretty veridical. For instance, if one throws two sticks on the floor, then the result might be one of the four configurations in Fig. 2—with, notably, the same probability for all four if they are taken exactly as they are depicted. If taken as representatives of classes of similar configurations, however, their (subjective) probabilities are indeed inversely correlated to the conditional complexities for these individual configurations. For Bayesian inference, this implies that one could just as well use precisals instead of actual probabilities, because the decisive conditionals yield about the same predictive power in both cases (van der Helm, 2000).

Second, FE predictive coding seems to give action priority over perception, or as Friston (2009) put it: “perception is an inevitable consequence of active exchange with the environment” (p. 293) and ”perception is enslaved by action to provide veridical predictions” (p. 295). However, as shown above, the role of action in everyday perception (or ”active inference”, as Friston calls it)—though certainly relevant—is rather simple and straightforward. The foregoing also shows that the inclusion of action into the equation is not helpful in assessing what the first priors in perceptual organization might be. After all, as long as one has approximately the right conditionals, Bayesian inference works quite well for a moving observer—no matter which first priors are used. Yet, the question of the first priors is definitely relevant in human perception research, which, for instance, also aims to understand the perception of static images (which, in this multimedia era, probably are more abundantly present than in the past). Whereas FE predictive coding is silent about what the first priors might be (see also Footnote 4 and section “Empirical priors”), structural coding gives a principled answer by taking the precisal of a hypothesis as its first prior.

Discussion

As Hoffman (1996) put it in Bayesian terms: We have direct access to only the posteriors of perception. Hence, to understand these posteriors, we have to trace back what the priors and conditionals might have been. Bayes’ rule captures the interplay between priors and conditionals but does, of itself, not supply any specification of priors and conditionals. Therefore, standard Bayesian modeling involves model fitting to tune the parameters of a selection model such that it yields desired outcomes (this stands apart from hypothesis selection, i.e., the subsequent application of such a selection model to find hypotheses that meet the employed selection criterion). This powerful modeling method may well reflect learning strategies at higher cognitive levels, but in my view, perception plays a special role, which is to be distinguished from that of higher cognitive faculties.

Perception is sort of a communication channel, or interface, between the world and higher cognitive faculties. Following Leonardo da Vinci’s (1452–1519) motto “All knowledge has its origins in perception”, structural coding therefore takes perception as a fairly autonomous, data-driven, source of knowledge instead of taking knowledge as a resource for perception (cf. Firestone & Scholl, in press; Gottschaldt, 1926; Hochberg, 1978; Kanizsa, 1985; Pylyshyn, 1999; Rock, 1985). Furthermore, as said, the structural coding model aims to minimize the number of parameters needed to describe hypotheses (this is a form of hypothesis selection), but the structural coding model itself is basically parameter-free (so, no tuning of the selection model to get desired outcomes). In other words, by its simplicity principle, it gives a principled account of priors and conditionals, which, as indicated, provides fairly optimal encoding of data and fairly veridical perception in daily life.

In perceptual organization, the Bayesian distinction between view-independent priors and view-dependent conditionals (be they precisals or other probabilities) concurs with the distinction between the ventral and dorsal streams in the brain, which seem to be dedicated to object perception and spatial perception, respectively (Ungerleider & Mishkin, 1982). The Bayesian integration of priors and conditionals can thus be said to model the interaction between these streams, which leads the visual system from percepts of objects as such to percepts of objects arranged in space. To structural coding, this is the (obviously grey) area where perception tends to end and higher cognitive faculties get the opportunity to enrich its output via a gradually more conscious inference on the basis of internally available contextual information (say, knowledge).

For instance, disks with shadings at the left-hand or right-hand side give fairly ambiguous impressions of concavity and convexity (see Fig. 4a), whereas disks with shadings at the top or bottom give fairly clear impressions of concavity and convexity, respectively (see Fig. 4b). By structural coding, all such disks are perceptually ambiguous. Yet, in some cases, such ambiguities might be resolved at higher cognitive levels (Rock, 1985)—here, for instance, by the knowledge that light usually comes from above. Some Bayesians incorporate such knowledge in models of perception but I do not think this is needed for the main task of perception, which is to organize incoming (meaningless) pieces of visual information into (meaningful) wholes and parts arranged in space.

Fig. 4
figure 4

Shape from shading. a Shading at the left-hand or right-hand side is fairly ambiguous regarding the concavity or convexity of the disks. b Shading at the top or bottom is fairly clear regarding concavity and convexity, respectively. (After Ramachandran, 1988)

This main task means that a percept reflects a hierarchical organization of a scene. In structural coding, candidate percepts (i.e., hypotheses) are assumed to be constructed from the sensory data and are represented by hierarchical codes, which impose such hierarchical organizations on the data. This contrasts with Bayesian approaches (including FE predictive coding), which are strong in capturing the interplay between probabilities of given hypotheses but which usually are silent about how these hypotheses are structured and represented (be it formally or in the brain).

In sum, regarding competence, FE predictive coding admittedly uses a powerful modeling technique, but in my view, structural coding has more explanatory power because of its principled account of priors and conditionals in terms of fairly stable descriptive complexities. It is also true, however, that FE predictive coding’s main claims pertain not so much to competence but rather to performance. This is discussed next.

Performance

Traditional ideas about the human visual perceptual organization process have taken it to be nothing but a unidirectional, feedforward, process from sensory inputs to percepts. This holds neither for FE predictive coding nor for structural coding: Both coding approaches rely on recurrent and horizontal processing too. However, they put forward different forms of message passing. To compare them, I take Lee and Mumford’s (2003) description of hierarchical Bayesian inference in the visual cortex as reference.

Hierarchical Bayesian inference

Lee and Mumford (2003) proposed a Bayesian predictive coding model that is not based on minimization of prediction errors. Instead, it takes visual area V1—which, via the lateral geniculate nucleus, receives input from the retina—as the first area to construct what they called particles, that is, preliminary interpretations of input parts. These particles are assumed to stay alive during a hierarchical inference process, by which a higher visual area takes particles from the previous area to construct its own larger particles, whose strength then is fed back to the previous area to allow for particle updating—and so on, until the system as a whole reaches an equilibrium. This process is called particle filtering, and during this process, particle updating is assumed to be guided by Bayesian belief propagation. The latter means that the feedback from higher areas provides what they called contextual priors to shape the inference at lower areas.

Lee and Mumford allowed knowledge from higher cognitive levels (say, from beyond perception) to provide such feedback too, but notice that they essentially proposed a data-driven perceptual inference process, by which partial percepts (i.e., particles) interact and compete to arrive eventually at a complete percept. They were not specific about the internal representational structure of particles, but they did suggest that particles might be represented by temporarily synchronized neural assemblies.

Neuronal synchronization is the phenomenon that neurons, in transient assemblies, temporarily synchronize their firing activity. This is a special case of parallel distributed processing (PDP). That is, standard PDP typically involves interacting agents who simultaneously do different things, whereas synchronization involves interacting agents who simultaneously do the same thing—think of flash mobs or choirs going from cacophony to harmony. Both theoretically and empirically, neuronal synchronization has been associated with various cognitive processes and 30–70 Hz gamma-band synchronization, in particular, has been associated with feature binding in visual perceptual organization (Eckhorn et al., 1988; Gray & Singer, 1989; Milner 1974; von der Malsburg, 1981).

As I discuss next, FE predictive coding (which includes effects of knowledge) proposes hierarchical Bayesian inference too, but not in the form described by Lee and Mumford. As I discuss subsequently, structural coding (which excludes effects of knowledge) proposes a particle-filtering mechanism, but then with particle updating guided by propagation of the Occamian simplicity belief and with a computationally powerful specification of the representational role of neuronal synchronization in the gamma band.

FE predictive coding’s cognitive architecture

Whereas Lee and Mumford’s (2003) predictive coding approach holds that “the feedforward input drives the generation of the hypotheses” (p. 1436), FE predictive coding argues for more or less the reverse. It explicitly dismisses particle filtering (Friston, 2008, 2009) and relies instead on top-down testing of hypotheses against the sensory input (see Fig. 5). This top-down testing goes, in a hierarchical fashion, through the successive levels in the cortex—each level taken to be responsible for specific (intermediate) aspects. At each level, higher-level predictions are compared with lower-level sensory information to form a prediction error, which is returned to the higher level to enable it to update its predictions—these updated predictions then are recycled to reduce prediction errors at lower levels. In other words, feedforward connections convey information on prediction errors, while feedback connections convey information on predictions from higher cortical areas to suppress prediction errors in lower areas (Bastos et al., 2012).

Fig. 5
figure 5

FE predictive coding’s view on processing in the brain’s visual hierarchy. Predictions are tested top-down against the sensory input, and prediction errors are returned. At each level in the hierarchy, prediction errors are updated by combining messages from the same level and the level above, and predictions by combining messages from the same level and the level below

Hence, FE predictive coding basically proposes a sort of glorified template matching. It is true that template matching can be effective in automatic recognition of things from a limited number of predefined categories, such as print characters or objects in an assembly line. However, in human vision research, it has been abandoned long ago because it is too rigid and limited to deal with ill-defined categories and novel objects. To be frank, I do not see how FE predictive coding’s glorified version might turn the tables.

Empirical priors

As said, in FE predictive coding, feedback connections convey information on predictions from higher areas to suppress prediction errors in lower areas. This feedback is said to constitute empirical priors, which are claimed to dissolve the criticism of Bayesian models that they ignore the question of how prior beliefs—necessary for inference—are formed (Friston, 2010). However, notice that these empirical priors depend on the sensory data, that is, they actually are posteriors—which is not altered by the fact that they, just as in plain Bayesian inference (see section “Bayesian inference and the role of action in perception”), are fed back to become the new priors for the next inference cycle. In any case, they do not dissolve the just-mentioned criticism of Bayesian models, which is about first priors, that is, about priors that are independent of the sensory data (see also Trappenberg & Hollensen, 2013). First priors are relevant, simply because they form the starting point of the inference process.

To be clear, the foregoing does not question feedback mechanisms as such. Feedback mechanisms are inherent to hierarchical inference models. For instance, as discussed in section “Hierarchical Bayesian inference”, Lee and Mumford’s (2003) predictive coding model involves feedback of what they called contextual priors, that is, strengths of higher-level particles that had been composed of lower-level ones. Furthermore, structural coding did not invent a special name for the feedback information but it incorporates basically the same feedback mechanism as that in Lee and Mumford’s model—except that it expresses particle strength in descriptive complexities instead of probabilities (see section “Perception”). In other words, I understand the relevance of the empirical priors in the FE predictive coding scheme, but I think that FE predictive coding gives them more credits than they deserve.

Attention and gamma synchronization

According to Friston (2009), attention simply is the process of optimizing the relative precision of feedforward and feedback information during the hierarchical inference process. Later, Friston (2010) and Bastos et al. (2012) suggested faintly that neuronal synchronization in the gamma band has something to do with prediction errors, and after that, Clark (2013) and Kanai et al. (2015) suggested that gamma synchronization controls the precision associated with prediction errors at lower levels relative to that at higher levels. Notice that, unlike what Friston (2009) attributed to attention, the latter applies to feedforward information only. Be that as it may, my present point is that there is no direct evidence, neither neurophysiologically nor otherwise, that attention or gamma synchronization controls the precision associated with prediction errors.

In other words, FE predictive coding’s account rather seems to be a matter of reading into the facts, that is, of attempting to connect the favored approach to accepted phenomena—then, gamma synchronization might indeed be positionable only as being associated somehow with prediction errors. To be clear, I do not object to such attempts, but in this case, I think it is not convincing without, for instance, complementary formal support for the proposed computational role of gamma synchronization.

Structural coding’s cognitive architecture

Compared to FE predictive coding, structural coding assumes other messages being passed up and down in the brain’s visual hierarchy. Furthermore, structural coding admittedly contains speculative components too, but it does supply complementary formal support for the computational role it attributes to gamma synchronization.

Structural coding’s view on processing in the visual hierarchy includes both perception and (task-driven, top-down) attention. As discussed in van der Helm (2012, 2015a), it conceives of the perceptual organization process as comprising three neurally intertwined but functionally distinguishable subprocesses (see Fig. 6, left-hand panel; cf. Lamme & Roelfsema 2000; Lamme et al. 1998). These subprocesses are responsible for (a) feedforward extraction of, or tuning to, features to which the visual system is sensitive, (b) horizontal binding of similar features, and (c) recurrent selection of different features. These subprocesses together yield integrated percepts given by hierarchical organizations (i.e., organizations in terms of wholes and parts) of hypothesized distal stimuli that fit the sensory data (see Fig. 6, right-hand panel). Attentional processes then may scrutinize these organizations in a top-down fashion, that is, starting with global structures and, if required by task and allowed by time, descending to local features (Ahissar & Hochstein, 2004; Collard & Povel, 1982; Hochstein & Ahissar, 2002; Wolfe 2007). This may be specified further as follows for attention and perception, respectively.

Fig. 6
figure 6

Structural coding’s view on processing in the brain’s visual hierarchy. A stimulus-driven perceptual organization process, comprising three neurally intertwined subprocesses (left-hand panel), yields hierarchical stimulus organizations (right-hand panel). A task-driven attention process then may scrutinize these hierarchical organizations in a top-down fashion

Attention

Structural coding assumes that, guided by descriptive simplicity, the unconscious perception process arrives at complete percepts (i.e., perceived wholes) via nonlinear interactions between competing partial percepts (I return to this in a moment). It assumes further a top-down attentional scrutiny of resulting hierarchical organizations, which implies that wholes are consciously experienced before parts. This explains the dominance of wholes over parts, as postulated in early twentieth century Gestalt psychology (Koffka, 1935; Köhler, 1920, 1929; Wertheimer, 1912, 1923) and as confirmed later in a range of behavioral studies (for a review, see Wagemans et al., 2012). This dominance means, for instance, that humans tend to classify or categorize things on the basis of their global structures (i.e., on the basis of wholes and ignoring minor differences in parts). Based on empirical data, it has been specified further by notions such as global precedence (Navon, 1977), configural superiority (Pomerantz, Sager, & Stoever, 1977), primacy of holistic properties (Kimchi, 2003), and superstructure dominance (Leeuwenberg & van der Helm, 1991; Leeuwenberg, van der Helm, & van Lier, 1994).

To give an example of the dominance of wholes over parts, I consider Fig. 7. It shows a stimulus that is typically perceived as consisting of two triangular parts. These triangular parts therefore are said to be compatible with the perceived global structure, and they are more easily discerned than incompatible parts like the diamond in Fig. 7, bottom right. By Fig. 6, this can be understood as follows (see also van der Helm, 2015b). The perceptual organization process yields perceived hierarchical organizations in terms of global structures and their constituent local features. This means that it preserves the representations of the compatible constituents, and masks (or suppresses, or eliminates, or inhibits) those of incompatible parts. Thus, if a to-be-discerned local feature is compatible, the top-down attention process may exploit the perceived hierarchical organization to descend easily from its global structure to this local feature. If it is not compatible—as is typical in embedded figures tasks, for instance—the top-down attention process first is misled by the perceived global structure and then has to find a way around it.

Fig. 7
figure 7

Embedded figures. At the top, a stimulus with a typically perceived organization comprising two triangular shapes, plus one of these easily discerned compatible parts and an incompatible diamond part that is less easily discerned. (After Kastens & Ishikawa, 2006)

Perception

Among the perceptual subprocesses in Fig. 6, the subprocess of feedforward extraction is reminiscent of the neuroscientific idea that, going up in the visual hierarchy, neural cells mediate detection of increasingly complex features (Hubel & Wiesel, 1968). Furthermore, the subprocess of recurrent selection is reminiscent of the connectionist idea that a standard PDP process of activation spreading in the brain’s neural network yields percepts represented by stable patterns of activation (Churchland, 1986). In structural coding, the combination of these two subprocesses is taken to be like a fountain under increasing water pressure: As the feedforward extraction progresses along ascending connections, each passed level in the visual hierarchy forms the starting point of integrative recurrent processing along descending connections. For a similar picture, see VanRullen and Thorpe (2002), and notice that this mechanism—just as Lee and Mumford’s (2003) particle filtering— yields a gradual buildup from percepts of parts at lower levels in the visual hierarchy to percepts of wholes near its top end.

By nature, this gradual buildup takes time, so, it leaves room for attention to intrude and to modulate things before a percept has completed. In this sense, structural coding does not exclude influences from higher cognitive levels entirely. However, it also assumes that the perceptual organization process is very fast and that it, by then, already has done much of its integrative work (cf. Gray, 1999; Pylyshyn, 1999). Structural coding attributes this speed to neuronal synchronization in the gamma band. Notice that 30–70-Hz gamma oscillations are faster than 8–30-Hz alpha and beta oscillations. The latter usually are associated with top-down processes, while gamma synchronization occurs predominantly in horizontal neural assemblies within visual areas, which have been associated with binding of similar features (Gilbert, 1992). The latter subprocess may be relatively underexposed in neuroscience, but may well be the neuronal counterpart of the regularity extraction operations, which, in representational coding approaches, are proposed to obtain structured mental representations of incoming visual information.

In fact, structural coding postulates that gamma synchronization mediates transparallel feature processing, which means that many similar features are hierarchically recoded in one go, that is, simultaneously as if only one feature were concerned (van der Helm, 2012, 2014, 2015a). There is no direct evidence that the brain indeed performs transparallel processing, but to my knowledge, it is the first computational proposal to do justice to the idea that neuronal synchronization must be a special form of neuro-cognitive processing. Moreover, this computational proposal is substantiated formally as follows.

Transparallel processing

In computing, transparallel processing corresponds to the extraordinarily powerful form of processing promised by quantum computers (see van der Helm, 2015a). Actually, it is already feasible on single-processor classical computers, and structural coding implemented it in PISA, which is a minimal coding algorithm for strings (van der Helm, 2004, 2015a). Notably, to compute guaranteed simplest codes of strings, PISA employs formal counterparts of the three perceptual subprocesses in Fig. 6. Through exploiting visual regularities such as repetition and symmetry, such codes specify strings by a minimum number of descriptive parameters (for formal and empirical underpinnings of the choice of employed regularities, see van der Helm & Leeuwenberg, 1991, 1996, 1999, 2004). Notice that a string gives rise to a superexponential number of candidate codes (i.e., hypotheses), so that the simplest one is probably not tractable by traditional forms of processing.

In PISA, this problem has been solved by employing special, usually sparse, distributed representations called hyperstrings. Hyperstrings are superpositions of up to an exponential number of similar regularities, which can be hierarchically recoded in a transparallel fashion, that is, simultaneously as if only one regularity were concerned.Footnote 7 For more details, see van der Helm (2012, 2014, 2015a), in which hyperstrings are taken as formal counterparts of transient neural assemblies, while transparallel processing is proposed to be the special form of neuro-cognitive processing mediated by synchronization in such neural assemblies. Notice that this is consistent with Lee and Mumford’s (2003) suggestion that particles might be represented by temporarily synchronized neural assemblies.

Strings do not, of course, constitute input like that of the human visual system. Nevertheless, the foregoing provides formal support for the idea that transparallel processing—mediated by gamma synchronization—might be the powerful form of neuro-cognitive processing needed to solve the superexponential inverse problem of perception.

Discussion

Neurophysiological evidence, on the one hand, links experimental conditions to brain activity but does, of itself, not indicate what this brain activity means in terms of cognitive information processing. Behavioral evidence, on the other hand, links experimental conditions to the outcome of cognitive information processing but does, of itself, not indicate how this outcome is arrived at. To dig deeper, one has to resort to performance models, or cognitive architectures as they are called in artificial intelligence research (Anderson, 1983; Newell, 1990). The architectures proposed by FE predictive coding and structural coding (see Figs. 5 and 6, respectively) are examples of such performance models. Clearly, both architectures still have to be elaborated further. Yet, it seems safe to say that FE predictive coding is ahead in its account of the neurophysiological side (see Clark, 2013), while structural coding is ahead regarding critical tests at the behavioral side (see Leeuwenberg & van der Helm, 2013) and regarding formal support for the proposed computational role of gamma synchronization (see van der Helm, 2012, 2014, 2015a).

Clarity about the role of gamma synchronization is particularly relevant to understand effects of impaired gamma synchronization, as found in neurodevelopmental disorders such as schizophrenia (Uhlhaas, Silverstein, & Phillips, 2005) and autism spectrum disorders (ASD) (Grice et al., 2001; Maxwell et al., 2015; Milne et al., 2009; Sun et al., 2012; Wright et al., 2012). For instance, within FE predictive coding, Clark (2013) suggested that impaired gamma synchronization leads to an imbalanced precision associated with prediction errors, that is, a higher precision at lower levels relative to that at higher levels (see also Kanai et al., 2015). Clark (2013) argued that this might explain hallucinations and delusions in schizophrenia (cf. Fletcher & Frith, 2009), but see also Silverstein (2013) who argued that these symptoms are more likely to arise at higher cognitive levels.

Furthermore, Lawson, Rees, and Friston (2014) argued that the imbalanced precision by impaired gamma synchronization idea explains various perceptual and social-exchange symptoms in ASD. However, without referring to gamma synchronization, Van de Cruys et al. (2014) argued that those symptoms can be explained by a high inflexible precision of prediction errors at both lower and higher levels. As a consequence, Van de Cruys et al. argued, ASD individuals put more value on small errors than typical individuals do. Whatever the relation between precision and gamma synchronization may be, putting more value on small errors agrees with findings that ASD individuals tend to focus more on local information in visual stimuli than on global information. For instance, they tend to categorize things into smaller categories (see, e.g., Klinger & Dawson, 2001; Newell et al., 2010) and are better in discerning embedded figures like the diamond in Fig. 7 (Jolliffe & Baron-Cohen, 1997; Shah & Frith, 1983, 1993).

In structural coding, the proposed role of gamma synchronization has nothing to do with prediction errors or their precision and is not, as in FE predictive coding, associated post-hoc with empirical data. It is based on formal computational grounds and implies that gamma synchronization subserves integration of local features into global structures (see section “Transparallel processing”). By this account, impaired gamma synchronization leads to less developed global structures. Such a reduced perceptual integration would affect classification abilities, and thereby, generalization and learning abilities (as seems to be the case in schizophrenia; see Doody et al., 1998). By the same token, it would result in categorization into smaller categories. Furthermore, linking up with Section “Attention”, it would also result in weaker masking effects on embedded figures (i.e., local features that are incompatible with typically perceived global structures), which therefore would be better discernable (van der Helm, 2015b). The latter agrees with the weak central coherence theory of ASD (Frith, 1989; see also Happé & Booth, 2008).

In other words, structural coding holds that, depending on the severity of the disorder, ASD individuals are left with something between incoming pieces of visual information and typically perceived wholes (think of an unfinished jigsaw puzzle). Then, top-down attention hardly has anything global to focus on, so, it naturally exhibits a narrowed focus and its access to embedded figures is hindered less by global structures. As van der Helm (2015b) argued, this also means that structural coding predicts that typical individuals are not worse than ASD individuals in discerning parts that are compatible with typically perceived global structures (like the triangular parts in Fig. 7)—simply because, in typical individuals, compatible parts are not masked by perceived global structures (see section “Attention”). As far as I can tell, this is not what FE predictive coding would predict, so, future tests of this prediction may prove to be critical.

General discussion

FE predictive coding and structural coding both use free-energy minimization as metaphor for processing in the brain, but their elaborations of this metaphor are fundamentally different. FE predictive coding relies on classical information theory to minimize prediction errors, using probabilities to be tuned via model fitting. Structural coding relies on modern information theory to minimize the information load of predictions using fairly stable descriptive complexities. I am admittedly biased towards structural coding, but in this article, I have tried to make a fair assessment of FE predictive coding.

To be frank, I found it hard to deconstruct. For instance, in section “FE predictive coding’s cognitive architecture”, I indicated that template matching has been abandoned long ago in human vision research and that I do not see how FE predictive coding’s glorified version might turn the tables. Furthermore, its sometimes grandiloquent statements often seem to capitalize on intuitive associations in readers. One example thereof is its usage of the association-laden term surprise instead of the formal term surprisal from classical information theory. Another example is Bastos et al.’s (2012) “through selecting appropriate sensations, the brain is implicitly maximizing the evidence for its own existence” (p. 702; see also Friston, 2010). To me, the last part is esoterism, and I would not say that the brain selects appropriate sensations. I simply would say instead that, through action, it can select different vantage points but will have to do with whatever sensations it gets. In this sense, as indicated in section “Bayesian inference and the role of action in perception”, I think that FE predictive coding exaggerates the role of action in perception.

Be that as it may, notice that I sympathize with the more general Bayesian brain idea—albeit that I make a clear functional distinction between perception and higher cognitive levels. For instance, I can appreciate that—to increase practical utility—one might want to include knowledge (e.g., about the environment) in machine vision systems. However, I think that—due to transparallel processing—the human perceptual organization process is so fast that it hardly leaves room for effects of such knowledge, and that such apparent effects rather reflect post-perceptual enrichment. I therefore think that knowledge-based Bayesian approaches might be suited to model inferences at higher cognitive levels, but that perceptual inferences rather are guided by the Occamian simplicity belief working on data to construct, on the fly, hypotheses about these data.

Structural coding pursues the latter, accomplishing much of what FE predictive coding aims to accomplish—including links from perception to attention and action. Structural coding needs further elaboration, particularly at the neurophysiological side. Yet, as discussed, it is basically a parameter-free approach, which, by its simplicity principle, gives a principled account of priors and conditionals, providing fairly optimal encoding of data and fairly veridical perception in daily life. To this end, it relies on minimal coding of the internal structure of individual messages, which seems an appropriate reflection of the way in which the brain might encode sensory information in an efficient and parsimonious fashion. Furthermore, it substantiates that transparallel processing—mediated by gamma synchronization—might be the form of neuro-cognitive processing that solves the inverse problem of perception by way of a flexible, self-organizing, cognitive architecture implemented in the relatively rigid neural architecture of the brain.

In structural coding’s minimal-coding algorithm PISA, transparallel processing is enabled by hyperstrings, which are distributed representations built on the fly by the subprocess of horizontal feature binding and operated on by the subprocess of recurrent feature selection. By these data structures, structural coding links up with network models (see van der Helm, 2012). Furthermore, by converting descriptive complexities into precisals, structural coding might be given a Bayesian formulation that, in various respects, would resemble Lee and Mumford’s (2003) model. In other words, structural coding can be said to represent a separate branch of the diverse family of predictive coding models.

Finally, my critique of FE predictive coding should not obscure that I do appreciate that it—just as other predictive coding approaches and just as structural coding—aims to unify ideas about competence and performance. The distinction between these two notions corresponds to the distinction between what Wertheimer called the molar and molecular levels (see Koffka, 1935) or what Marr (1982/2010) called the ”what” and ”how” questions. As Marr noted, answering these questions may be totally different endeavors, but answers to both questions are needed for a full understanding.