Skip to main content

Self-supervision, normativity and the free energy principle

Abstract

The free energy principle says that any self-organising system that is at nonequilibrium steady-state with its environment must minimize its (variational) free energy. It is proposed as a grand unifying principle for cognitive science and biology. The principle can appear cryptic, esoteric, too ambitious, and unfalsifiable—suggesting it would be best to suspend any belief in the principle, and instead focus on individual, more concrete and falsifiable ‘process theories’ for particular biological processes and phenomena like perception, decision and action. Here, I explain the free energy principle, and I argue that it is best understood as offering a conceptual and mathematical analysis of the concept of existence of self-organising systems. This analysis offers a new type of solution to long-standing problems in neurobiology, cognitive science, machine learning and philosophy concerning the possibility of normatively constrained, self-supervised learning and inference. The principle can therefore uniquely serve as a regulatory principle for process theories, to ensure that process theories conforming to it enable self-supervision. This is, at least for those who believe self-supervision is a foundational explanatory task, good reason to believe the free energy principle.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. In machine learning, there is a recent focus on self-supervision as a crucial subtype of unsupervised learning, that is, learning that does not require labelled training data. In this research area, self-supervised learning proceeds by withholding some of the data and then letting the system predict it (cf. contextual cues); in this way data supervise the learning (such that self-supervision, in some sense, is a type of supervised learning too). In this paper, the notion of self-supervision is used in a quite general and generic sense, which captures the autonomy of self-supervision and its reliance on an internal model of causes of sensory input; it is intended to capture ‘truly’ unsupervised learning, namely where the system only relies on itself and its own exploration for normative constraints on learning and inference (this is pursued in Sects. 4, 5 below); the notion of unsupervised learning goes back to at least Barlow (e.g., Barlow 1989) and is now a textbook topic in machine learning.

  2. There might of course be other concepts of the existence of biological organisms. The current approach focuses on the idea of nonequilibrium steady state, which immediately suggests a statistical analysis. There is a long history connecting existence of biological systems with nonequilibrium steady state (or far from equilibrium states); (see, e.g., Ashby 1947; Nicolis and Prigogine 1977; Prigogine and Nicolis 1971; Schrödinger 1944; Von Bertalanffy 1950). More philosophically-oriented discussion can be found in Mark Bickhard’s work, such as (Bickhard 2009), which draws on the dynamical systems approach to discuss both the nature of representation and topics close to FEP; there is an interesting project in exploring their affinities (see also Bickhard 2016). There are non-equilibrium steady state systems that are not biological in the common sense, such as non-biological adaptive systems, and perhaps phenomena such as tornadoes. There is debate about this issue of the scope of FEP; (e.g., Sims 2016). For this paper, I set aside a full discussion of scope issues and focus on the kinds of systems for which self-supervision and normativity are commonly discussed.

  3. Surprise is also known as surprisal and corresponds to Shannon's self-information: the negative log probability of some system states, conditioned that system or model. The average self-information of a system is entropy; suggesting that avoiding surprises places an upper bound on the entropy of a system’s states.

  4. There is substantial debate about the very notion of conceptual analysis and the a priori in philosophy. I do not here rely on any particular approach but merely on a basic sense of conceptual analysis as given by grasp and elucidation of concepts, and the way in which that is a priori at least in the sense of not immediately requiring empirical investigation. It may be that our conceptual analysis is susceptible to empirical evidence, and this can lead to considerations about whether the initial analysis failed, or whether the concept has changed; it may also be that there are different conceptual analyses of the same terms (e.g., in different cultures), which in turn raises questions whether this is a case of different understandings of the same concepts or of different concepts (much of this discussion plays out in Frank Jackson’s defence of conceptual analysis and the debate following Jackson (1998)). My argument here is subject to the eventual fate of these questions, together with other purported a priori conceptual analyses in philosophy.

  5. The recognition model is sometimes described as a recognition density; namely, an approximate posterior probability density over unknown (external) states of the world causing sensory states. This model or density can be considered a Bayesian belief about something; namely, unknown states of the world (note that Bayesian beliefs are not propositional beliefs). Note also a distinction I set aside here: optimising a recognition density is different from optimising a model. In statistics, this is the difference between Bayesian model inversion—to produce an approximate posterior over unknown causes, given a model—and Bayesian model selection—0to produce an approximate posterior over competing models.

  6. A functional is a function of a function. The free energy is a functional because it is a function of a probability density; namely, q, which is a probability density function of external states.

  7. The implicit conversion of an intractable integration problem into a tractable optimisation problem is due to Richard Feynman, who introduced the notion of variational free energy via the path integral formulation of quantum electrodynamics. It was subsequently exploited in machine learning, where minimising variational free energy is formally synonymous with approximate Bayesian inference.

  8. Notice that the problem of self-supervised learning and inference relates to the general problem of learning where any learning system seeking to estimate a data generating process would have to minimize risk (expected value of some loss function). The problem is that the relevant joint probability distributions are unknown. So, the system has to minimize some proxy (e.g., empirical risk; Vapnik 1995). The FEP uses a proxy as well, minimizing (expected) free energy, to minimise surprise.

  9. There are various complexity measures in the literature. An influential approach belongs with the Akaike Information Criterion (Akaike 1974). Under FEP, complexity is conceived as the KL divergence between the prior and posterior distributions; a smaller divergence indicates that less complexity was introduced to account for new observations.

  10. This is not to suggest that one should believe FEP merely because it is in some sense ‘mathematical’ (though there is perhaps a sense in which mathematical proof should be believed). Rather the point is that, when investigating what reasons we have for believing FEP, we should look for mathematical (and conceptual) reasons, not empirical evidence.

  11. Hamilton’s principle states that a mechanical system develops in time such that the integral of the difference between kinetic and potential energy is stationary. There is some debate about the epistemic status of Hamilton’s principle (see, e.g., Smart and Thébault 2015; Stöltzner 2009); in unpublished work the latter authors have argued that a Humean about laws can place Hamilton’s principle as the most fundamental law, making it essentially empirical rather than a priori). I think most standard descriptions are consistent with the reading that it is not a law of nature in the usual sense but a mathematical principle for understanding the dynamics of a physical system in terms of a variational problem, given information about the system and the forces acting on it. My appeal to Hamilton’s principle is not intended to establish complete parity between it and FEP; it may very well be that the former is not driven by conceptual analysis in the way I have argued FEP is. I appeal to the principle here to indicate that there is precedence in science for considering something a principle, which systems may or may not conform with, rather than a law. Another question for further discussion is whether Newton’s laws stand to Hamilton’s principle as process theories like predictive coding stand to FEP. Note finally that there is deep affinity between FEP and Hamilton’s principle, such that FEP applies to systems for which Hamilton’s Principle holds, hence, if the latter is not a priori then arguably FEP would not be either; for discussion of fundamental physics and FEP, see Friston (2019).

  12. Heuristically, if the underlying distribution is multimodal (i.e., non-normal, or not gaussian), then predictive coding can mischaracterise a given sample, which is close to one peak, as a large prediction error relative to another peak; for discussion, see the Hierarchical Gaussian Filter developed in Mathys et al. (2014). For discussion of how this relates to empirical research on particular systems, such as human brains, see Friston (2009). In terms of evidence, Friston remarks that “there is no electrophysiological or psychophysical evidence to suggest that the brain can encode multimodal approximations: indeed, with ambiguous figures, the fact that percepts are bistable (as opposed to bimodal and stable) suggests the recognition density is unimodal” (2009: p. 298).

  13. Here the question of the scope of FEP is relevant. FEP is so general that it may apply to systems like single fat cells, which would not be sharing cognitive architecture with humans. The question what FEP implies for such systems then depends on the assumptions made for them. For FEP applied to really basic model systems, see Friston (2013). Interesting questions arise about the meaning of key notions, such as ‘inference’, ‘representation’, and ‘computation’ and how far and in what manner they might deviate from their literal senses, associated with symbolic representation etc. It seems likely that to get literal inference/computation/representation, we need to appeal to some subset of process theories and assumptions of particular systems of FEP, such as those arguably applying to humans. The thrust of my argument in this paper, to be unfolded in the next section, is that FEP entails approximation to Bayesian inference, and therefore a sense of normativity that seems relevant for what might be regarded basic notions of representation and misrepresentation (at least in the sense of genuine norm-violation). I think it is a substantial further discussion how far and in what way the notions ‘inference’ and ‘representation’ used here deviate from literal (e.g., symbolic) inference and representation. My view is that, in so far as FEP ensures normativity, the use of those notions is justified to all systems for which FEP applies, since we often cash out those notions precisely in terms of normativity. I do also think it is likely that FEP will eventually lead to a recalibration of what we mean by ‘inference’ and ‘representation’. However, the issues here are substantial and some aspects will need to be developed in subsequent research.

  14. I am here using these terms in a fairly generic, philosophical sense. In the fields of cognitive science, machine learning, and statistical learning, there is substantial treatment of the issue of supervision, using somewhat different understandings of the notion of supervision. In machine learning approaches, there are many unsupervised algorithms and many things that organisms do that involve supervised learning (including supervision by nature). In philosophy, supervised learning raises foundational problems of normativity, essentially related to the learner’s grasp of the labels, which must be considered before supervised learning can truly be understood. I am here implying that self-supervised learning is, or should be, a (or perhaps the) quest of machine learning. Of course, valuable machine learning advances can come from devising robust supervised learning algorithms, but from a philosophical perspective, machine learning will only throw light on human intelligence (or approximate human intelligence) if it begins from a basis of self-supervision. This claim is based on the observation, versions of which stretch all the way back to Kant and beyond, that human intelligence must come about just by relying on sensory input and prior belief. See also footnote 1 for comments on my use of the notion of self-supervision.

  15. Here and elsewhere in the paper, various typically personal-level terms are used (‘know’, ‘believe’, ‘evidence’ etc.). This is not to imply that these are personal level rather than subpersonal level processes or states. I am agnostic on how to draw that boundary and here simply use these terms more or less like they are used in the wider literature and in textbooks on machine learning and statistical learning. ‘Knowing’ is the appropriate notion to use here because the reasoning behind FEP leads to the idea that the model is inferred (and what a system infers it in some sense knows). There is a substantial, different debate to be had about the sense in which ‘approximate inference’ is ‘inference’, related to these issues, which is beyond the scope of this paper.

  16. I am not here providing a foundational defence of Bayesianism as such; I am relying on the fact that FEP implies an approximation to the exact Bayesian posterior, which is a good candidate for being a paradigm of normativity. For discussion of Bayesian optimality, see Rahnev and Denison (2018).

  17. There have been previous suggestions linking self-organisation and self-supervision. Ashby, for example, argued that systems can be both self-organised and also display determinate behaviour (Ashby 1954). The current proposal makes this link via the notion of self-evidencing inherent in FEP.

  18. In this paper, the focus is on self-organising systems, which is what FEP is formulated for. Such systems can act to maintain themselves in their expected states. There are some very substantial questions about where to put the boundary between self-organised systems in this sense and any other system in the broadest sense (e.g., mechanical systems, or any system indeed that physics can describe). In some iterations, the notion of free energy minimization is so general that it literally applies to every thing (Friston 2019); (see also fn 11). In other words, something is needed to distinguish mere causal mechanisms from self-organising biological organisms (and to distinguish between things and non-things). One distinction is between things that can model expected free energy and infer policies on this basis in order to engage in active self-evidencing, and things that cannot. Mechanical systems cannot do this, if their action repertoire is pre-set (by a designer who has performed the active inference for the mechanism, for example; for self-supervised artificial intelligence mechanisms, the discussion veers into the substance of this paper). Hence, the question how the system can minimize surprise, if it can’t know its model a priori, pertains to self-organising systems conceived in this way. Further research is needed to fully discuss the question what if anything FEP and kindred approaches imply about non-self-organising systems. In particular, there is the question what, if anything, the notions of perceptual and active inference come to in these kinds of cases where it is less clear that they apply; for example, a projectile is described by Hamilton’s principle but does not, in any sense of the word, “compute” its stationary points of action.

  19. These comments may need some clarification, to set them in the context of computational neuroscience and machine learning approaches. The argument here is not that, a priori, all ‘bottom-up’ approaches fall short of converging on Bayes’ rule; and there are bottom-up learning methods (from single-layer perceptrons onwards) for which there are convergence proofs. Rather, the argument is that some of these bottom-up methods rest on supervised learning (e.g., perceptrons), which raises the problem of normativity in focus here, and suggesting they do not conform to FEP. If the bottom-up method does in fact conform to FEP, then it can potentially form a process theory (subject to assumptions and anatomical plausibility) for which the problem of normativity will not arise, and thereby a good starting point for the quest for truly self-supervised learning. In this light, convergence does not suffice for normativity because the underlying process should also be self-supervised in a manner that does not evoke the problem of normativity; conformity to FEP demonstrates that both constraints are satisfied. When it comes to the ‘top-down’ approaches more commonly endorsed by FEP, the claim is that they are normative in the sense of converging on Bayes; a good place to explore the mathematical grounds for this claim is in discussion of the Hierarchical Gaussian Filter (Mathys et al. 2011) that sets precision-weighted prediction error minimization in the context of approximate inference and reinforcement learning (with a dynamic learning rate).

  20. The time scale over which free energy is assessed is important. Here I just consider it the appropriate time scale for the organism in question but there is a substantial further question to address here. The critical point here is that approximation is not instantaneous.

  21. An interesting question is whether this approach will outlaw “weird” concepts like, indeed, a regular concept dog-or-sheep, which itself has satisfaction conditions. FEP could allow such concepts if they had conceivable free energy minimization properties. If not, and if such concepts are regarded as bona fide concepts nevertheless, then FEP would only be a partial solution. The example given here is semantic but the account extends to action, that is, the problem of inferring a specific policy, for example, for avoiding an environment that is too cold. Active inference sets out how expected free energy is minimised, given an internal model of the environment’s states (including the agent’s own states). That is, a precise policy for action (conceived as a series of control states) is inferred, which is expected to maintain the system in its expected (not too cold) states (e.g., putting on a coat). The system can rank and execute specific policies but only given an internal model governed by FEP, which cashes out normativity by allowing a KL-divergence to be minimised between states achievable given a policy and states the system expects to occupy. Policies in active inference (for expected surprise) are then like priors in perceptual inference (for actual surprise). A policy is inferred by assessing the surprise expected under different policies. On the basis of the model’s inferred policy, an expectation of sensory input is generated, which is minimized through action. As such, policies are part of the generative model and help specify the expected states of the system. The cost function itself is absorbed into the priors, and policies can also be updated and there can be model selection (based on complexity considerations/Bayesian model evidence). In this sense, just as a system’s internal model can have a fine-grained set of priors that describe its beliefs about the world, it can have a fine-grained set of policies that describe beliefs about how it acts in the world. Active inference thus furnishes an answer to challenges about decision-making and inference of specific policies, such as (Klein 2016).

  22. There is interesting discussion from the dynamical systems perspective, but kindred to FEP topics, in Bickhard (2009).

  23. An interesting question arises here about artificial systems running active inference algorithms but which are not easily seen as self-organised or autonomous systems: do they display normativity? A full answer is beyond the scope of this paper partly because it touches on issues around emulation, which may undermine true self-supervision. It may be that some artificial systems can be considered truly normative, in the FEP sense, and therefore self-organising.

  24. Notice that here care is taken not to imply that FEP itself implies cognitive architecture. Notions of architecture will need to build on assumptions about the particular system in question, which will constrain processes for message passing structure. It is a topic for further discussion how assumptions play this role, and what assumptions may look like in various non-human systems. This invokes a larger issue in philosophy of science concerning the ways in which principles constrain process theories (or laws). For FEP, the starting point for this issue is the idea of the addition of assumptions about particular systems to the principle. However, further discussion is needed of what this exactly means: is the relation something on a spectrum between derivation (given assumptions) and more informal notions of coherence, for example? There is a significant body of literature on these questions, which is still to make contact with the specific status of principles versus process theories (see, e.g., Craver 2005; Zednik and Jäkel 2016).

  25. There are many proposals for unsupervised learning, which are not designed to conform to FEP (e.g., Zheng et al. 2018). The claim is not that such approaches cannot deliver what they promise. It may be that they are in fact conforming to FEP, or they may establish normativity in their own right (perhaps conforming to some other principle). Here, the focus is FEP and the claim is that it has a philosophically appealing approach to normativity in terms of self-supervised (as that term is used here).

References

Download references

Acknowledgements

Thank you to Stephen Gadsby, Karl Friston and Thomas Parr for comments and suggestions on earlier versions; and to anonymous reviewers for several suggestions. Thank you for comments and suggestions to participants in workshops at Macquarie University and University of Cambridge, including Dan Williams who organised the latter, where some of this material was presented. Thank you to the members of the Cognition and Philosophy Lab for comments. This research is supported by the Australian Research Council DP190101805.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jakob Hohwy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hohwy, J. Self-supervision, normativity and the free energy principle. Synthese 199, 29–53 (2021). https://doi.org/10.1007/s11229-020-02622-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11229-020-02622-2

Keywords

  • Self-organisation
  • Self-evidencing
  • Self-supervision
  • Unsupervised learning
  • Free energy principle
  • Normativity
  • Active inference
  • Principles
  • Predictive processing
  • Process theories