## Abstract

A central concept in active inference is that the internal states of a physical system parametrise probability measures over states of the external world. These can be seen as an agent’s beliefs, expressed as a Bayesian prior or posterior. Here we begin the development of a general theory that would tell us when it is appropriate to interpret states as representing beliefs in this way. We focus on the case in which a system can be interpreted as performing either Bayesian filtering or Bayesian inference. We provide formal definitions of what it means for such an interpretation to exist, using techniques from category theory.

### Keywords

- Bayesian filtering
- Bayesian inference
- Category theory

This is a preview of subscription content, access via your institution.

## Buying options

## Notes

- 1.
We leave open the possibility that they could be distinguished by looking at some broader context, e.g. by discovering that a device’s designer intended a particular interpretation, or that evolution selected for a particular interpretation.

- 2.
In fact for most of the paper we will work much more abstractly than this. It would be more correct to say “objects in a Markov category” wherever we say “measurable space” and “morphisms in a Markov category” wherever we say “Markov kernel,” since for most of the paper we will reason at the category level, and we will not directly invoke the definition of a measurable space. We have chosen to use the more concrete terms because they express a clear intuition for how these objects and morphisms are intended to be interpreted.

- 3.
or in a more general context, the unit object of a monoidal category.

- 4.
In category theory terms, this means that the set of all delete kernels collectively forms a natural transformation. (Specifically, it is a natural transformation from the identity functor to the functor that sends all objects to \(\mathbbm {1}\) and all morphisms to .) For this reason this property of delete kernels is called “naturality.”.

## References

Aguilera, M., Millidge, B., Tschantz, A., Buckley, C.L.: How particular is the physics of the Free Energy Principle? arXiv:2105.11203 (2021). http://arxiv.org/abs/2105.11203

Albantakis, L., Massari, F., Beheler-Amass, M., Tononi, G.: A macro agent and its actions. arXiv:2004.00058 (2020). http://arxiv.org/abs/2004.00058

Ay, N., Löhr, W.: The Umwelt of an embodied agent-a measure-theoretic definition. Theory Biosci. = Theorie in Den Biowissenschaften

**134**(3–4), 105–116 (2015). https://doi.org/10.1007/s12064-015-0217-3Baez, J., Stay, M.: Physics, topology, logic and computation: a rosetta stone. In: Coecke, B. (ed.) New Structures for Physics, Lecture Notes in Physics, pp. 95–172. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-12821-9_2

Beer, R.D.: Autopoiesis and cognition in the game of life. Artif. Life

**10**(3), 309–326 (2004). https://doi.org/10.1162/1064546041255539Beer, R.D.: The cognitive domain of a glider in the game of life. Artif. Life

**20**(2), 183–206 (2014). https://doi.org/10.1162/ARTL_a_00125Biehl, M., Ikegami, T., Polani, D.: Towards information based spatiotemporal patterns as a foundation for agent representation in dynamical systems. In: Proceedings of the Artificial Life Conference 2016, pp. 722–729. The MIT Press (2016). https://doi.org/10.7551/978-0-262-33936-0-ch115, https://mitpress.mit.edu/sites/default/files/titles/content/conf/alife16/ch115.html

Biehl, M., Kanai, R.: Dynamics of a bayesian hyperparameter in a markov chain. In: IWAI 2020. CCIS, vol. 1326, pp. 35–41. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64919-7_5

Biehl, M., Pollock, F.A., Kanai, R.: A technical critique of some parts of the free energy principle. Entropy

**23**(3), 293 (2021). https://doi.org/10.3390/e23030293, https://www.mdpi.com/1099-4300/23/3/293Bolt, J., Hedges, J., Zahn, P.: Bayesian open games. arXiv:1910.03656 (2019). http://arxiv.org/abs/1910.03656

Capucci, M., Gavranović, B., Hedges, J., Rischel, E.F.: Towards foundations of categorical cybernetics. arXiv:2105.06332 (2021). http://arxiv.org/abs/2105.06332

Capucci, M., Ghani, N., Ledent, J., Forsberg, F.N.: Translating Extensive Form Games to Open Games with Agency. arXiv:2105.06763 (2021). http://arxiv.org/abs/2105.06332

Cho, K., Jacobs, B.: Disintegration and bayesian inversion via string diagrams. Math. Struct. Comput. Sci.

**29**(7), 938–971 (2019). https://doi.org/10.1017/S0960129518000488, http://arxiv.org/abs/1709.00322, arXiv: 1709.00322Coecke, B., Paquette, É.: Categories for the practising physicist. In: Coecke, B. (ed.) New Structures for Physics, Lecture Notes in Physics, pp. 173–286. Springer Heidelberg (2011). https://doi.org/10.1007/978-3-642-12821-9_3

Coecke, B., Kissinger, A.: Picturing Quantum Processes: A First Course in Quantum Theory and Diagrammatic Reasoning. Cambridge University Press, Cambridge (2017)

Da Costa, L., Friston, K., Heins, C., Pavliotis, G.A.: Bayesian Mechanics for Stationary Processes. arXiv:2106.13830 [math-ph, physics:nlin, q-bio] (2021). http://arxiv.org/abs/2106.13830, arXiv: 2106.13830

Dennett, D.C.: True believers : the intentional strategy and why it works. In: Heath, A.F. (ed.) Scientific Explanation: Papers Based on Herbert Spencer Lectures Given in the University of Oxford, pp. 53–75. Clarendon Press (1981)

Fong, B., Spivak, D.I.: An invitation to applied category theory: seven sketches in compositionality. Cambridge University Press, Cambridge (2019)

Friston, K.: A free energy principle for a particular physics. arXiv:1906.10184 [q-bio] (2019). http://arxiv.org/abs/1906.10184, arXiv: 1906.10184

Friston, K., Da Costa, L., Hafner, D., Hesp, C., Parr, T.: Sophisticated Inference. Neural Comput.

**33**(3), 713–763 (2021). https://doi.org/10.1162/neco_a_01351Friston, K.J., Da Costa, L., Parr, T.: Some interesting observations on the free energy principle. Entropy

**23**(8), 1076 (2021). https://doi.org/10.3390/e23081076, https://www.mdpi.com/1099-4300/23/8/1076Fritz, T.: A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics. Adv. Math.

**370**, 107239 (2020). https://doi.org/10.1016/j.aim.2020.107239, https://www.sciencedirect.com/science/article/pii/S0001870820302656Jacobs, B.: A channel-based perspective on conjugate priors. Math. Struct. Comput. Sci.

**30**(1), 44–61 (2020). https://doi.org/10.1017/S0960129519000082, https://www.cambridge.org/core/journals/mathematical-structures-in-computer-science/article/channelbased-perspective-on-conjugate-priors/D7897ABA1AA06E5F586F60CB21BDDB32Jacobs, B.: A Channel-Based Perspective on Conjugate Priors. arXiv:1707.00269 (2018). http://arxiv.org/abs/1707.00269

Jacobs, B., Staton, S.: De Finetti’s construction as a categorical limit. In: Petrişan, D., Rot, J. (eds.) Coalgebraic Methods in Computer Science, Lecture Notes in Computer Science, pp. 90–111. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57201-3_6

Knill, D.C., Pouget, A.: The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci.

**27**(12), 712–719 (2004). https://doi.org/10.1016/j.tins.2004.10.007, https://www.cell.com/trends/neurosciences/abstract/S0166-2236(04)00335-2Kolchinsky, A., Wolpert, D.H.: Semantic information, autonomous agency and non-equilibrium statistical physics. Interface Focus

**8**(6), 20180041 (2018). https://doi.org/10.1098/rsfs.2018.0041, https://royalsocietypublishing.org/doi/full/10.1098/rsfs.2018.0041Krakauer, D., Bertschinger, N., Olbrich, E., Flack, J.C., Ay, N.: The information theory of individuality. Theory Biosci.

**139**(2), 209–223 (2020). https://doi.org/10.1007/s12064-020-00313-7Libby, E., Perkins, T.J., Swain, P.S.: Noisy information processing through transcriptional regulation. Proc. Natl. Acad. Sci.

**104**(17), 7151–7156 (2007)Ma, W.J., Jazayeri, M.: Neural coding of uncertainty and probability. Ann. Rev. Neurosci.

**37**, 205–220 (2014). https://doi.org/10.1146/annurev-neuro-071013-014017McGregor, S.: The bayesian stance: equations for ‘as-if’ sensorimotor agency. Adapt. Behav., 105971231770050 (2017). https://doi.org/10.1177/1059712317700501, http://journals.sagepub.com/doi/10.1177/1059712317700501

Nakamura, K., Kobayashi, T.J.: Connection between the bacterial chemotactic network and optimal filtering. Phys. Rev. Lett.

**126**(12), 128102 (2021). https://doi.org/10.1103/PhysRevLett.126.128102, https://link.aps.org/doi/10.1103/PhysRevLett.126.128102Orseau, L., McGill, S.M., Legg, S.: Agents and Devices: A Relative Definition of Agency. arXiv:1805.12387 (2018). http://arxiv.org/abs/1805.12387

Parr, T., Da Costa, L., Friston, K.: Markov blankets, information geometry and stochastic thermodynamics. Phil. Trans. Roy. Soc. A Math. Phys. Eng. Sci.

**378**(2164), 20190159 (2020). https://doi.org/10.1098/rsta.2019.0159, https://royalsocietypublishing.org/doi/full/10.1098/rsta.2019.0159Risken, H., Frank, T.: The Fokker-Planck equation: methods of solution and applications. In: Springer Series in Synergetics, 2 edn. Springer-Verlag, Heidelberg (1996). https://doi.org/10.1007/978-3-642-61544-3, https://www.springer.com/gp/book/9783540615309

Rosas, F.E., Mediano, P.A.M., Biehl, M., Chandaria, S., Polani, D.: Causal blankets: theory and algorithmic framework. In: IWAI 2020. CCIS, vol. 1326, pp. 187–198. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64919-7_19

Smithe, T.S.C.: Bayesian Updates Compose Optically. arXiv:2006.01631 (2020). http://arxiv.org/abs/2006.01631

St Clere Smithe, T.: Cyber kittens, or some first steps towards categorical cybernetics. Electron. Proc. Theor. Comput. Sci.

**333**, 108–124 (2021). https://doi.org/10.4204/EPTCS.333.8, http://arxiv.org/abs/2101.10483v1Still, S., Sivak, D.A., Bell, A.J., Crooks, G.E.: The thermodynamics of prediction. Phys. Rev. Lett.

**109**, 120604 (2012). arXiv e-print 1203.3271 http://arxiv.org/abs/1203.3271Wikipedia contributors: Conjugate prior – Wikipedia, the free encyclopedia (2021). https://en.wikipedia.org/w/index.php?title=Conjugate_prior&oldid=1030202570, Accessed 8 July 2021

## Acknowledgements

The work by Martin Biehl on this publication was made possible through the support of a grant from Templeton World Charity Foundation, Inc. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of Templeton World Charity Foundation, Inc. Martin Biehl is also funded by the Japan Science and Technology Agency (JST) CREST project.

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## Appendices

### A Category-Theoretic Probability and String Diagrams

In this paper we use some concepts from category-theoretic probability, and in particular we use a notation known as string diagrams. A full introduction to these topics would be out of scope of the paper, but we include here an informal introduction to the topic. We do this because, to our knowledge, no concise introduction currently exists that is focused on (classical) probability and does not assume a background in category theory. We assume that the reader knows the definition of a category, but not much more than that.

Appendix A.1 introduces the basic concepts, mostly in the context of discrete probability. In Appendix A.2 we briefly comment on how this extends to the general case of measure-theoretic probability with very little extra work. In Appendix A.3 we explain how to reason about conditional probabilities and Bayes’ theorem within this category-theoretic context.

These sections contain no original material. Their purpose is to give the reader enough information to be able to read the string diagram equations in the main text and later sections of the Appendix without needing to consult a category theory text. However, they are intended neither as an authoritative technical reference nor as a comprehensive review, and readers should consult the cited references for full details.

### 1.1 A.1 Introduction to String Diagrams and Category-Theoretic Probability

A full technical introduction to the use of string diagrams in probability can be found in [22] or the earlier [13], but these works require some knowledge of category theory. The string diagram notation predates its use in probability and has many other applications. One could consult [4, 14, 15, 18] for tutorial introductions to diagrammatic reasoning in other fields, of various different flavours. Here we present it somewhat informally and only in the context of probability.

It should be kept in mind that, despite our somewhat informal introduction, string diagrams are formal expressions. The main difference between them and the more familiar kind of mathematical expression formed from strings of symbols is their two-dimensional syntax. This makes it easier to express certain concepts. (Particularly those relating to joint distributions, in the case of probability.)

We use the so-called *Markov category* approach to probability [22]. The main idea here is to express everything in terms of *measurable spaces* and *Markov kernels*, whose definitions we outlined in the main text.^{Footnote 2} To explain how the framework works, let us consider the special case where the only measurable spaces we are interested in are finite sets (with their power sets as their \(\sigma \)-algebras). If *A* and *B* are finite sets then a Markov kernel can be thought of as just a function \(f:A\rightarrow P(B)\), where *P*(*B*) is the set of all probability distributions over *B*. (The set *P*(*B*) may be thought of as a \((|B|-1)\)-dimensional simplex, consisting of all those vectors in \(\mathbb {R}^{|B|}\) whose components are all non-negative and sum to 1.) Such a function amounts to a |*B*|-by-|*A*| stochastic matrix, although some care needs to be taken over which rows correspond to which elements of *B* and which columns to which elements of *A*.

In this finite case, we write to denote the probability that the kernel *f* assigns to the outcome \(b\in B\) when given the input \(a\in A\). We use a thick vertical line to indicate a close relationship to conditional probability while also emphasising that the concept is different: given a kernel \(f:A\rightarrow P(B)\) the quantities are always defined, regardless of whether any probability distribution has been defined over *A*, and regardless of whether *a* has a nonzero probability according to such a distribution. More common notations include | or ; in place of .

We also write *f*(*a*) for the probability distribution over *B* that the function *f* returns when given the input *a*. We could say that is defined as *f*(*a*)(*b*).

Given Markov kernels \(f:A\rightarrow P(B)\) and \(g:B\rightarrow P(C)\), we can compose them to form a new kernel of type \(A\rightarrow P(C)\). We write this . It is given by

In this finite case this is simply matrix multiplication, and we could have denoted it *gf* instead of accordingly. (Another common notation is \(g\circ f\).) We prefer because it puts *f* and *g* in the same order that they will appear in string diagrams.

It is straightforward to show that composition is associative, that is

In addition, for every finite set *A* there is an identity kernel, which amounts to just the |*A*|-by-|*A*| identity matrix. We write this as \(\mathsf {id}_A\) and define it by . For every Markov kernel \(f:A\rightarrow P(B)\) we have

These two facts mean that there is a category whose objects are finite sets and whose morphisms are Markov kernels between finite sets. This category is called \(\mathsf {FinStoch}\).

Since Markov kernels are morphisms in a category, we will often write instead of \(f:A\rightarrow P(B)\), using the dotted arrow to distinguish morphisms in \(\mathsf {FinStoch}\) and related categories from ordinary functions. (In the main text we continue writing them as functions in order to avoid introducing new notation.)

The composition of Markov kernels can be generalised to the case of measure-theoretic probability, which allows us to reason about continuous probability and more general probability measures using the same kinds of diagram and much of the same reasoning. We briefly discuss this in more detail in Appendix A.2. The main difference is that composition becomes integration over measures rather than summation.

Probability measures themselves may be seen as a special case of Markov kernels. Consider a set with a single element, denoted \(\mathbbm {1}= \{\star \}\). (The identity of the element does not matter because all one-element sets are isomorphic to each other. Category theorists often speak of “the one-element set” for this reason. We use a star to denote the element.) Then a Markov kernel is a function \(p:\mathbbm {1}\rightarrow P(A)\), which takes an element of \(\mathbbm {1}\) and returns a probability measure over *A*. Since there is only one element of \(\mathbbm {1}\) this means that the kernel *p* only defines a single probability measure over *A*. We therefore think of Markov kernels and probability measures over *A* as essentially the same thing.

We now begin to introduce the string diagram notation. A Markov kernel will be denoted

This expression means much the same thing as the notation . It is just a formal symbol denoting the kernel *f*, annotated with type information.

The composition of kernels and is written

The left and right hand side of this equation are just two different ways to write the composite kernel , as defined by Eq. (8) or its measure-theoretic generalisation.

In reading a diagram like the right-hand side of Eq. (12) we find it helpful to imagine an element of *A* travelling along the wire from the left. As it passes through the kernel *f* it is stochastically transformed into an element of *B*, in a way that might depend on its original value. It then travels further to the right and is stochastically transformed by *g* into an element of *C*. Equation (8) can be seen as describing this process.

In string diagrams a special notation is used for identity kernels (or identity morphisms more generally): an identity kernel \(\mathsf {id}_A\) is drawn simply as a wire with no box on it,

For any Markov kernel the identity law Eq. (10) can then be written

This allows us to think of the wires as stretchy: we can extend and contract them at will. We will think of the wires as continuously deformable, rather than extending and contracting in discrete units. This is justified by the formal theory of string diagrams. (One may informally think of the wire itself as an infinite chain of identity kernels, all composed together.) This ability to continuously deform diagrams turns out to be an extremely powerful and useful idea.

Another special notation is used for one-element sets^{Footnote 3}: they are drawn as no wire at all. For this reason a probability measure over *A*, that is, a kernel , is drawn as

(Morphisms of this kind are sometimes known as “states,” and they are often drawn as a triangle rather than a box, though here we draw them in the same style as other morphisms.)

It is worth noting that the kernels *p* and *f* above can be composed, yielding

Because of this, although the kernel is defined as a function \(f:A\rightarrow P(B)\) mapping *elements* of *A* to probability distributions over *B*, we can instead choose to see it as mapping *probability measures* over *A* to probability measures over *B*. In the finite case, if we think of finite probability distributions as normalised and nonnegative vectors in \(\mathbb {R}^{n}\), then *f* can be seen as a linear map with the property that it maps points in one simplex to points in another. (This justifies thinking of it as a stochastic matrix.)

The string diagram notation becomes useful when we start thinking about joint distributions. We do this by drawing wires in parallel. As an example, we can consider a Markov kernel defined by a function \(h:A\times B \rightarrow P(C\times D)\). This function takes two arguments, an element of *A* and an element of *B*, and it returns a joint probability distribution over *C* and *D*. In string diagrams we write this as

In symbols, we write . An object like \(A\otimes B\), drawn as two parallel wires, can either be thought of as the measurable space \(A\times B\) (which is the Cartesian product of sets in the finite case), or as the space of probability measures over \(A\times B\). The symbol \(\otimes \) is referred to as a monoidal product.

There is some inherent ambiguity in this notation. If we draw three parallel wires, it could either mean \((A\otimes B)\otimes C\) or \(A\otimes (B\otimes C)\). In the finite case, these correspond to the sets \((A\times B)\times C\) and \(A\times (B\times C)\). These are different sets, since one is composed of pairs ((*a*, *b*), *c*) and the other of pairs (*a*, (*b*, *c*)). This ambiguity is not important in practice, however, and the formal machinery of *monoidal categories* allows us to use string diagrams without worrying about it. We do not give a formal treatment of this here. (A concise summary can be found in [4].) Instead we simply remark that when we draw three parallel wires we think of joint distributions over *A*, *B* and *C*, and the precise distinction between \(P(A\times (B\times C))\) and \(P((A\times B)\times C)\) will not be important to us.

In a similar vein, the spaces *A* and \(A\otimes \mathbbm {1}\) are different, but the difference is not important to us, and in fact they are written the same way in string diagrams. This is because we draw \(\mathbbm {1}\) as an invisible wire. This also allows us to write

That is, string diagrams are stretchy in the vertical direction as well as the horizontal one. We can bend the wires, as long as we don’t deform them so much that they point backwards, from right to left.

This also allows to write things like

for a kernel .

We can also draw morphisms (i.e. Markov kernels) in parallel with each other, for example,

We write this in symbols as \(f\otimes g\), which is a morphism of type . In the finite case, it is given by

The probabilities and are multiplied together because the two Markov kernels are operating in parallel. One can imagine an element of *A* entering from the bottom left and being stochastically transformed by *f* into an element of *B*, while in parallel, and independently, an element of *C* enters from the top left and is stochastically transformed by *g* into an element of *D*. In general, in the finite case, \(f\otimes g\) is given by the tensor product of the stochastic matrices that represent *f* and *g*. (This might give some intuition for the symbol \(\otimes \).)

We can cross wires over each other. (In category theory terms, the categories we are concerned with are symmetric monoidal categories.) The diagram

can be seen as a Markov kernel . In the finite case it is defined by

We have a number of equations that are standard in monoidal category theory, and allow us to freely slide boxes along wires and bend wires to cross over each other. These can either be shown directly from the definitions above or (perhaps more usefully) deduced from the definition of a symmetric monoidal category. Three such equations are as follows. More details can be found in the references cited above.

So far, everything we have said about string diagrams applies to any symmetric monoidal category. However, there are two additional things we can add that take us much closer to probability theory. These are the ability to *copy* and to *delete*. These operations, and their special properties, do not necessarily exist in other contexts, such as quantum mechanics. This is a central point of [4, 14]. We will stick to the context of classical probability, however, so copying and deletion will always be possible in this paper.

We cover deletion first. For every measurable space *A* there is a unique kernel of type \(A\rightarrow \mathbbm {1}\), which we call \(\mathsf {del}_A\). In the finite case it is given by for all \(a\in A\). We can think of this as a \(1\times |A|\) matrix (i.e. a row vector) whose entries are all 1. This is the only possible \(1\times |A|\) stochastic matrix.

In string diagrams we write such a deletion kernel as a black dot:

There is one such morphism for every measurable space, but we denote them all with the same kind of black dot. These black dots have the property that

for every Markov kernel *f*. This says that if we take some input *A*, perform some stochastic operation *f* on it and then delete the result, this is the same as simply deleting the input.^{Footnote 4}

The second special operation is copying. For every measurable space *A* there is a kernel \(\mathsf {copy}_A:A\rightarrow A\otimes A\), which we will describe shortly. We write this also as a black dot, but this time with two output wires rather than one.

Informally, this kernel takes an outcome \(a\in A\) and copies it, producing a pair (*a*, *a*) of identical values. It’s important to note that it copies *values* rather than *distributions*. Its output does not consist of two independent and identically distributed elements of *A* but rather two perfectly correlated elements of *A* that always have the same value. In the discrete case the copy map is defined as

In addition to Eq. (28), the copy and delete maps obey the following properties [22, Definition 2.1]:

Equation (31) says that if we make multiple copies of something it doesn’t matter which order we make them in. Equation (32) says that if we copy something and then delete one of the copies, that is the same as doing nothing to it. Equation (33) says that if we copy something and then swap the copies it makes no difference. (Because the two copies are the same as each other.)

Equations (34) and (35) are more technical. They say that if we have elements of *A* and *B* we can delete or copy them as a single element of \(A\otimes B\) or separately, as elements of *A* and *B*, and these should give the same result.

These equations can be derived from the definitions we have given for the finite case. They may also be derived in various more general measure-theoretic contexts [13, 22].

However, the approach of [22] is instead to treat them as *axioms*: any symmetric monoidal category with copy and delete maps that obey Eqs. (28) and (31) to (35) is called a *Markov category*. One can do a surprising amount of reasoning about probability theory using these axioms alone, although there are also Markov categories that do not directly resemble the category of measurable spaces and Markov kernels that we have described. There are various additional axioms that can to be added as well, which then allow more specific results to be proven. (See [22] for the details.)

An important thing to note about the copy operator is that, in general,

That is, copying the output of a kernel *f* is not the same as copying its input and then applying two copies of the kernel to it. Intuitively, this is because *f* might be stochastic. If we copy the output we end up with two perfectly correlated copies, whereas if we copy the input then the stochastic variations will be independent.

However, if the kernel is deterministic then copying its input is indeed the same as copying its output. In fact, in the Markov category framework this is the *definition* of a deterministic Markov kernel: we say a kernel \(h:A \rightarrow B\) is deterministic if

In this paper we use square boxes for kernels that are known to be deterministic, and boxes with rounded edges for general, possibly-stochastic kernels.

In the main text, we write Markov kernels as functions \(f:A\rightarrow P(B)\), and we write deterministic kernels as functions \(f:A\rightarrow B\). To be more precise, a deterministic kernel should really also be considered as a function \(f:A\rightarrow P(B)\), such that Eq. (37) is obeyed. However, if we assume we are working in a category called \(\mathsf {BorelStoch}\) (which is a common assumption in category-theoretic probability) then Eq. (37) implies that *f* always returns a delta measure [22, Example 10.5], and in this case there is not much harm in treating a deterministic kernel *f* as a function \(f:A\rightarrow B\).

### 1.2 A.2 The Extension to Measure Theory

Above we described the category \(\mathsf {FinStoch}\) and introduced string diagrams mostly in that context. Here we briefly describe how this generalises to the measure-theoretic case, which is needed in order to think about continuous probability.

In the measure-theoretic case the objects (*X*, *Y*, etc.) are any measurable spaces rather than only finite sets. Markov kernels are still functions \(f:X\rightarrow P(Y)\), but now *P*(*Y*) is the set of all probability measures on the measurable space *Y*. (That is, *P*(*Y*) is the set of all functions from the \(\sigma \)-algebra associated with *Y* to [0, 1], such that Kolmogorov’s axioms are obeyed.) *P*(*Y*) can itself be made into a measurable space in a standard way, and the function *f* must obey an additional restriction that it be a measurable function. (This means that the preimage of every element of *P*(*Y*) must be a member of the \(\sigma \)-algebra associated with *X*.)

In this case *f*(*x*) is a probability measure rather than a probability distribution, and composition is given by integration rather than summation. (See [22, Example 4] for the details.) This gives rise to a category called \(\mathsf {Stoch}\), whose objects are all measurable spaces and whose morphisms are all Markov kernels. (This category is also known as the Kleisli category of the Giry monad, for reasons we do not discuss here.)

Unfortunately the category \(\mathsf {Stoch}\) does not have all of the properties that one might want it to have. (See Appendix A.3 below.) Because of this a common approach is to work in a category called \(\mathsf {BorelStoch}\) (also discussed in [22, Example 4]), in which the objects are a subset of measurable spaces called standard Borel spaces, and the morphisms are all Markov kernels between standard Borel spaces. Standard Borel spaces include many kinds of measurable space that one would be likely to use in practice, and in particular they include both finite sets and \(\mathbb {R}^{n}\) with its usual \(\sigma \)-algebra.

In the present paper, the properties of \(\mathsf {BorelStoch}\) are used in two ways. Firstly, in \(\mathsf {BorelStoch}\) we can always use conditionals, as explained in the next section. Secondly, as a notational convenience we treat deterministic kernels and measurable functions as interchangeable, which makes sense in \(\mathsf {BorelStoch}\) but doesn’t hold in the more general case of \(\mathsf {Stoch}\).

### 1.3 A.3 Conditionals and Bayes’ Theorem

Conditional probabilities and Bayes’ theorem play central roles in the theory of inference. Here we briefly discuss how they look in string diagrams. Given a joint distribution we may want to split it up into a product of a marginal and a conditional, which in traditional notation, in the discrete case, would be written \(p(a,b) = p(a)\, p(b\,|\, a)\).

The category-theoretic approach, as set out in [13, 22], is slightly different. We write the following, which is called a *disintegration* of *q*. (The term “disintegration” is used because it is the opposite of integration.)

Here,
is the marginal of *A* according to the joint distribution *q*. In the finite case it can be written \(\sum _{b\in B} q(a,b)\). The kernel
is called a *conditional* of *p*. It is defined by Eq. (38), which in the finite case can be written

This is closely analogous to the identity \(p(a,b) = p(a)\, p(b\,|\, a)\). The difference is that \(p(b\,|\,a)\) is defined as *p*(*a*, *b*)/*p*(*a*), and is only defined when \(p(a)>0\). On the other hand, in Eq. (39), if \(\left( \sum _{b'\in B} q(a,b')\right) = 0\) for some \(a\in A\) then *q*(*a*, *b*) must be 0 for all \(b\in B\), and consequently the equation puts no constraint on in this case.

This means that instead of being undefined in this case, the conditional *c* is not *uniquely* defined: there may be many different kernels *c* that satisfy the equation.

This carries over to the general measure-theoretic case as well. If we are in the category \(\mathsf {BorelStoch}\) then for any joint distribution there exists at least one conditional that satisfies Eq. (38), but there might be many. (In the case of \(\mathsf {Stoch}\) conditionals may fail to exist at all, see [22, Example 11.3].)

We may also want to disintegrate a joint distribution that is a function of some parameter, e.g. . In this case Eq. (38) becomes

Conceptually this is very similar. We want the disintegration to hold for every parameter value \(z\in Z\), and we define the conditional to be a function of *z* as well as of \(a\in A\). In the discrete case, Eq. (40) is analogous to the identity \(p(a,b\,|\, z) = p(a\,|\, z)\,p(b\,|\, a,z)\).

Bayes’ theorem is closely related to conditional probability and can be expressed in a similar way. Given a prior
and a kernel
, we can define a *Bayesian inverse* of *f* with respect to *q*, which is a kernel
such that

The Bayesian inverse \(f^{\dagger }\) depends on the prior *q* as well as on the kernel *f*. If we had chosen a different distribution in place of *q*, the Bayesian inverse \(f^{\dagger }\) would be different. As with conditionals, Bayesian inverses are not necessarily unique, and for a given *f* and *q* there may be many kernels \(f^{\dagger }\) that satisfy Eq. (41). (In fact, Bayesian inverses can be seen as a special case of conditionals; see [13, 22].)

We may also consider the case where the prior takes a parameter, such as . In this case a Bayesian inverse also needs to depend on the parameter in general, which gives us the following more general definition:

The references [13, 22, 37] contain much more detail about Bayes’ theorem in this form.

### B More Details About Bayesian Interpretations

### 1.1 B.1 Unpacking Bayesian Filtering Interpretations

In this section we give some more intuition for Definition 2 and then note some consequences of it. The section deals mostly with the case where *S*, *Y* and *H* are discrete sets, meaning that we can reason in terms of probability distributions rather than measure theory. In this case Definition 2 can be written in a form that makes the relationship to Bayes’ theorem more clear. We define a notion of *subjectively impossible input*, which is a value of *S* that the reasoner believes with certainty will not occur as its next input. (This does not imply that the input actually is impossible according to the true dynamics of the environment.) We show that Definition 2 puts no constraints on the reasoner’s posterior after receiving a subjectively impossible input.

We also show that the possible interpretations of a machine only depend on which states can transition to which other states given which inputs, and not on the probabilities of such transitions. In addition, we show that some machines admit no non-trivial interpretations at all.

In order to unpack Definition 2 a little more, let us consider the case where *S*, *Y* and *H* are discrete. Before starting we note that in the finite case, the definition of \(\psi _S\), Eq. (4), can be written as

In this case, Eq. (5) can be written in symbols as

for all \(s\in S, h,\in H, y,y'\in Y\). We can cancel from both sides on the assumption that it is positive, yielding

The condition means that \(y'\in Y\) is a *possible next state* when the machine starts in state \(y\in Y\) and receives the input \(s\in S\). (There may be many possible next states in this situation because the machine may be stochastic.)

Let us then suppose that the machine starts in state *y*, receives an input *s*, and transitions to state \(y'\). Let *h* be an arbitrary element of *H*. The number can then be seen as the reasoner’s prior probability that the next state is *h* and the next input is *s*. In more traditional notation we might write this as \(P(H'=h, S=s)\), where we leave the state of the underlying machine implicit. (Here we do not attempt to formalise this in terms of random variables, but simply treat it as a kind of notational shorthand for .)

We may then regard as the reasoner’s prior probability that the next input is *s*, i.e. \(P(S=s) = \sum _{h\in H} P(H=h,S=s)\).

However, since is conditioned on \(y'\) rather than *y*, we instead regard it as the reasoner’s *posterior* probability that \(H'=h\). (We refer to \(H'\) rather than *H* here because after it receives an input its previous “next” hidden state becomes its current hidden state.) therefore corresponds to what we might write as \(P(H'=h \mid S=s)\).

With this informal shorthand notation Eq. (45) then says

which has the same appearance as a familiar identity from elementary probability theory. It corresponds to a single step of Bayesian filtering, which we spell out in more detail in Appendix B.2.

This shorthand notation gives some intuition for why Eq. (5) has the particular form it does, but it leaves the dependence on the state of the underlying machine implicit, and in so doing it obscures an important and subtle point. In a more traditional context, \(P(H'=h \mid S=s)\) is defined by

and has no value when \(P(S=s)=0\). However, in our case \(P(H'=h \mid S=s)\) is a shorthand for , which is defined even when .

In the case where \(P(S=s)>0\), Eq. (5) in the form of Eq. (46) demands that \(P(H'=h \mid S=s)\) is indeed equal to \(P(H'=h, S=s)/P(S=s)\). More precisely, if then we must have . However, if then Eq. (5) puts no constraints on at all, or indeed on .

In the case where *S* is a discrete set (even if *Y* and *H* are not discrete), we say that \(s\in S\) is a *subjectively impossible input* for a given state \(y\in Y\) if . The point is that the reasoner believes, with certainty, that it will not receive the input *s* as its next input. The reasoning above says that in this situation, *any* posterior over *H* is acceptable, because Bayes’ rule doesn’t specify what the posterior should be. We find this somewhat analogous to the fact that in logic one can deduce any proposition from a contradiction. Definition 2 indeed permits any posterior in the case of a subjectively impossible input. In fact, it even allows the posterior to be chosen stochastically in this case.

This is in a sense the minimal possible assumption we could make. However, one could imagine addressing the issue in a different way by changing the framework, thus introducing a subtly different notion of interpretation than the one we have presented here. One possibility would be to allow *partial interpretations*, where \(\psi _H\) becomes a partial function, meaning that not every state of the machine needs to have an interpretation at all. This would allow the posterior to be undefined in the case of a subjectively impossible input, rather than merely arbitrarily defined. Another possibility would be to strengthen Eq. (5) with additional conditions, forcing the posterior to be meaningful even after a subjectively impossible input. We suspect that such an approach can lead to an interesting way to formalise improper priors, which are also about having meaningful posteriors in the case of ‘impossible’ inputs, but we leave investigation of this to future work.

We note one other important consequence of the above reasoning, in the discrete case. When we express Eq. (5) in the form of Eq. (45), we see that it only depends on whether a transition from *y* to \(y'\) is possible given an input *s*, and not on the probability of such a transition. Thus, for Bayesian filtering interpretations (and hence also for Bayesian inference interpretations), the only property of a machine that matters is which states can be reached from which other states (in a single step) under a given input. (Strictly speaking this only makes sense in the discrete case, but we expect an analogous statement to this to hold more generally.)

This has the consequence that some machines only admit trivial interpretations. By a trivial interpretation we mean one where the posterior is always equal to the prior. Such interpretations exist for every machine, because we can always take the model to be such that *H* does not change over time and *S* does not depend on *H*, so that the input \(s\in S\) never gives any information about *H*. That is, in string diagrams, for any machine we can set

and

for any choice of distributions *q* and *r*. Then the conditions of Definition 2 will always be satisfied. This may be shown using string diagram manipulations and the Markov category axioms.

We now show that there is a class of machines that only admit trivial interpreatations. Consider a machine with the property that for all \(y,y',s\). That is, for a given input *s* and initial state *y*, every state \(y'\) has some nonzero probability of being the next state. In this case, Eq. (45) implies that

for all \(y,y'\in Y, s\in S, h\in H\). Since the left-hand side does not depend on \(y'\) it follows that must not depend on \(y'\) either. That is, for some fixed distribution *p*. (The other possibility would be that , but this can’t hold for all \(s\in S\), because must be nonzero for some *s* in order to be normalised.)

This means that if a machine satisfies this property then the only intrepretations it admits are trivial ones, of the form Eq. (49). This means that in order for a discrete machine to admit *any* non-trivial Bayesian filtering interpretation it must satisfy a fairly strong constraint, namely that some of its transition probabilities are zero.

This is to some extent a consequence of our choice to consider only exact Bayesian filtering interpretations. If a discrete machine has no non-zero transition probabilities it might still be possible to interpret it as performing some form of approximate inference, but defining such interpretations precisely is a task for future work.

### 1.2 B.2 More on Bayesian Filtering

In this section we show that Definition 2 does indeed correspond to Bayesian filtering, at least in the case of a deterministic machine. Our proof of this is inspired by [24, theorem 6.3], which proves an analogous fact about conjugate priors. The proof we give uses string diagram reasoning, which means that it holds even in the most general measure-theoretic context; we do not need to assume that the sets involved are discrete.

Since we restrict ourselves to only deterministic machines in this section, we will note a couple of things about deterministic machines before we talk about Bayesian filtering.

We first note that the condition for a machine \(\gamma \) to be deterministic is

This comes from the defining equation for deterministic morphisms, Eq. (37), and also the axiom (35), noting that \(\gamma \) is a kernel with input \(S\otimes Y\) and output *Y*.

Next we prove the following proposition, which is useful for reasoning about Bayesian filtering interpretations of deterministic machines.

### Proposition 1

Suppose is a deterministic machine, and let and be arbitrary Markov kernels. Then \(\psi _H\) and \(\kappa \) form a consistent Bayesian filtering interpretation of \(\gamma \) (i.e. Definition 2 is satisfied) if and only if

with \(\psi _{S,H'}\) and \(\psi _S\) as defined in Eqs. (3) and (4).

### Proof

To see that Definition 2 implies Eq. (52) we marginalise Eq. (5):

This implies Eq. (52) by the rules for Markov categories, specifically Eqs. (28) and (32).

For the other direction we assume Eq. (52) holds and calculate

The first step substitutes in the right-hand side of Eq. (52), the second rearranges using the rules of Markov categories, and the third uses the determinism condition. This proves that Eq. (5) holds.

We now consider what a Bayesian filtering task involves. The idea is that the reasoner has a model of a hidden Markov process, given by the kernel . As described in the main text, this kernel can be thought of as a process that simultaneously transforms the hidden state, stochastically, into a new value and emits a visible “sensor value.”

Given a kernel of this type, we can iterate it to produce sequences of values in *S*. For example, we can write

where \(S^{3}\) means \(S\otimes S\otimes S\) and \(\kappa ^{n}\) is notation for iterating the kernel *n* times. A kernel of this kind, thought of as an infinitely iterated process, is sometimes called a “coalgebra,” since it is a special case of a more general concept of that name. (e.g. [25] takes a coalgebraic approach to de Finetti’s theorem.)

For filtering we are interested in inferring the final hidden state of a system, given a finite sequence of visible states. In order to reason about this, we define the following kernel:

This can be seen as an interpretation map, mapping the state of a reasoner to its beliefs about its next *n* inputs, \(S^{n} = (S_1, \dots , S_n)\), along with the final value of the hidden state, \(H_n\). These take the form of a joint distribution between \(S^{n}\) and \(H_n\). This joint distribution is formed from the reasoner’s initial prior over the initial hidden state \(H_1\) (given by the kernel \(\psi _H\)) and the model \(\kappa \), which is iterated *n* times.

We define this because in filtering we wish to make a probabilistic inference of the final hidden state, \(H_n\), given the sequence of visible states \(S^{n}\). To infer \(H_n\) given \(S^{n}\) we seek a disintegration of \(\psi _{S^{n},H_n}\). (See Eq. (38) in Appendix A.3.) Specifically, we seek a kernel \(\psi _{H_n\mid S^{n}}:S^{n}\otimes Y \rightarrow P(H)\) such that

The kernel \(\psi _{H_n\mid S^{n}}\) takes in a sequence \(S^{n}\) of observations and returns the reasoner’s conditional beliefs about \(H_n\), given the sequence \(S_n\). It is also a function of the reasoner’s initial beliefs \(y\in Y\).

In fact such a kernel can be constructed iteratively in a natural way, if we assume that \(\psi _H\) and \(\kappa \) form a consistent Bayesian filtering interpretation. To do this, we first define the iteration of \(\gamma \), in a similar way to the iteration of \(\kappa \):

where there are *n* copies of \(\gamma \) on the right-hand side. We can then state the following result, which shows that consistent Bayesian filtering interpretations can indeed be seen as performing Bayesian filtering, in the discrete case.

### Proposition 2

The kernel is a conditional of \(\psi _{S^{n},H_n}\), satisfying Eq. (57), in that

### Proof

We begin by defining the kernel

We also define its iteration, \((\bar{\psi }_S)^{n}:Y\rightarrow Y\otimes S^{n} \), analogously to \(\kappa ^{n}\) and \(\gamma ^{n}\). We note that the consistency equation for Bayesian filtering interpretations, Eq. (5), can be written in terms of \(\kappa \) and \(\bar{\psi }_S\), as

We then calculate

where the last step is by applying the other steps inductively. We can then apply a second inductive argument in “the other direction” using Eq. (52), as follows:

where the last step is again by applying the other steps inductively.

We have proved that
is a conditional of \(\psi _{S^{n},H_n}\). The kernel
can be thought of as giving the reasoner’s beliefs about *H* after receiving a given sequence \(S^{n}\) of inputs, starting from a given initial state \(y\in Y\). The result shows that these beliefs are consistent with the agent’s prior \(\psi _H(y)\) and the model \(\kappa \), in the sense that the agent’s final posterior beliefs about *H* are a conditional of its initial joint beliefs about the sequence \(S^{n}\) and the final hidden state. We conclude that a deterministic machine with a consistent Bayesian filtering interpretation can indeed be seen as performing a Bayesian filtering task. We expect this to be true in the general case of stochastic machines as well.

### 1.3 B.3 Bayesian Inference Interpretations and Conjugate Priors

In the main text we noted that Bayesian inference corresponds to a special case of Bayesian filtering. By “Bayesian inference” here we mean the case where the reasoner is interpreted as assuming its inputs are i.i.d. samples from some known distribution with an unknown parameter space *H*, which we also call the hypothesis space.

The difference between inference and filtering is that we interpret the reasoner as believing that the value of *H* is unknown but fixed. That is, the reasoner assumes that *H* doesn’t change over time. This corresponds to a special case of filtering in which
for some kernel \(\phi \) that we also call the model.

While \(\kappa \) can be seen as a model of the environment’s dynamics, \(\phi \) has more of the character of a statistical model. It is a model of how the agent’s sensor values depend on the unknown value of the hidden parameter *H*. However, we do not put any constraints on the hypothesis space *H* or the model \(\phi \). In particular, we do not assume that \(\phi \) is an injective function \(H\rightarrow P(S)\), and we allow the case where *H* is a finite set.

In the case of inference rather than filtering, the kernels \(\psi _S\) and \(\psi _{S,H'}\) from Eqs. (3) and (4) can be written

and

We write \(\psi _{S,H}\) instead of \(\psi _{S,H'}\) because in the i.i.d. inference case there is only one hidden variable, that is, \(H'=H\). Thus, the joint distribution \(\psi _{S,H}(y)\) can be seen as the reasoner’s joint belief about its next input and the hidden variable *H*, when its underlying machine is in state *y*. The consistency equation for Bayesian inference, Eq. (6), then follows by substituting these for \(\psi _{S,H'}\) and \(\psi _S\) in Eq. (5), the consistency equation for Bayesian filtering interpretations.

As with Bayesian filtering interpretations, it is useful to consider the case in which the underlying machine is deterministic (but not necessarily discrete). In proposition 1 we gave a simpler version of the consistency equation for Bayesian filtering interpretations, which is equivalent to Definition 2 in the case of a deterministic machine. In the inference case we can substitute Eqs. (64) and (65) into this simplified consistency equation (Eq. (52)) to obtain

This is exactly the equation given by [24, Eq. 16] as a definition of a conjugate prior.

Both sides of Eq. (66) express a joint distribution between *S* and *H*, as a function of *Y*. In the context of conjugate priors, \(\phi \) is considered to be a family of distributions, with parameters *H*. Our interpretation map \(\psi _H\) corresponds to another family of distributions, which is a conjugate prior to \(\phi \). The machine state *Y* corresponds to the so-called hyperparameters, i.e. the parameters of \(\psi _H\).

This shift in perspective makes sense. In a computational context, conjugate priors are often useful precisely because they offer a way to perform inference without needing to directly calculate Bayesian inverses at run-time. Instead, the implementation only needs to keep track of the hyperparameters and update them in response to data. This update takes place according to a deterministic function, whose form depends on the family \(\phi \) and its conjugate prior \(\psi _H\). This updating of the hyperparameters is the role played by \(\gamma \): it takes in a data point in *S* along with the current value of the hyperparameters, and returns the updated hyperparameters. Equation (66) asserts that this must be done in such a way that the new value of *Y* does indeed correspond to the correct Bayesian posterior, when mapped to a distribution over *H* by the kernel \(\psi _H\).

We note that it is somewhat nontrivial to find a pair of kernels \(\psi _H\), \(\phi \) and a function \(\gamma \) such that Eq. (66) is obeyed. However, many such examples are known. (Although it is not an authoritative source, a useful list can be found online [40, under “Table of conjugate distributions”], which explicitly gives both kernels and the update function for each example.) Any example of a conjugate prior can be seen as a deterministic machine together with a consistent Bayesian inference interpretation. In addition, in Appendix C we give a number of examples of a different flavour, in that in our examples *H* is either a finite or a countable set.

### 1.4 B.4 Unpacking Bayesian Inference Interpretations

We now unpack Definition 3 by converting Eq. (6) into more familiar terms in the case where all the spaces are discrete sets, as we did for filtering interpretations in Appendix B.1.

In the case where *Y*, *H* and *S* are finite sets, Eq. (6) can be written as

or equivalently,

since we can cancel if we assume it is positive. For to be positive means that it is possible for the machine to transition from state \(y\in Y\) to state \(y'\in Y\) after receiving the input \(s\in S\).

We can now give an intuitive interpretation to the terms in this equation. If the machine starts in state *y*, receives input *s*, and transitions to state \(y'\) as a result, then we can regard as the reasoner’s prior beliefs about the hypothesis *h*, as its prior beliefs about the input *s*, and as the reasoner’s posterior belief about the hypothesis *h*. Equation (68) can then be compared, term by term, to the much more familiar equation

Here we have written \(p(s\mid h)\) in place of and \(p(h\mid s)\) in place of in order to emphasise the similarity to Bayes’ theorem in a more familiar form. Our definition, in the form of Eq. (6) or Eq. (67), differs from this in that it explicitly takes account of the machine’s state, and \(\phi \) and \(\psi _H\) are defined by Markov kernels rather than conditional probabilities.

We note that, as in the case of filtering (Appendix B.1), our definition of a consistent Bayesian inference interpretation allows the posterior to be arbitrary in the case of subjectively impossible inputs, i.e. those \(s\in S\) for which for a given state \(y\in Y\). Given such an input the reasoner may update its posterior to anything at all. As with filtering, we regard this as the minimal assumption we could have made, but we can imagine several other choices that one could make instead. These include allowing the posterior to be undefined in such cases; *requiring* it to be undefined; requiring it to obey some additional consistency equation such that the posterior would make sense even on subjectively impossible inputs; or requiring \(\phi \) and \(\psi _H\) to be such that subjectively impossible inputs do not exist. We would consider these to be subtly different kinds of interpretation, and we leave their further investigation to future work.

### C Details of Examples

### 1.1 C.1 An Interpretation of a Non-Deterministic Finite Machine

We here present a non-deterministic finite machine with internal state space \(Y = \{ y_0, y_1, y_2 \}\) and sensory input space \(S = \{ s_1, s_2 \}\). One Bayesian interpretation of this machine, for a hidden state space \(H = \{ h_1, h_2 \}\), is as follows (where \(\delta \) is the Kronecker delta):

Under this interpretation, the model \(\phi \) ascribed to the machine is that sensory inputs transparently reflect the hidden state. The machine, in internal state \(y_0\), is taken to be uncertain about the hidden state; in state \(y_i \in \{ y_1, y_2 \}\) it is taken to be certain that the hidden state is \(h_i\). The dynamics of the machine match this interpretation: it transitions deterministically to \(y_i\) when receiving input \(s_i\), unless \(s_i\) is “subjectively impossible” (\(s_2\) at \(m_1\), and \(s_1\) at \(m_2\)). Behaviour on subjectively impossible inputs is not constrained by the consistency equation, so this is a consistent Bayesian interpretation.

### 1.2 C.2 Machine Counting Occurrences of Different Observations

We now consider a countably infinite deterministic machine \((Y_0,S,\gamma _0)\). Let \(Y_0=\mathbb {N}^+\times \mathbb {N}^+\) (\(\mathbb {N}^+\) excludes 0) and the input space be \(S = \{ +1, -1 \}\). The machine deterministically computes the function \(f_0: (Y_0 \times S) \rightarrow Y_0\), in the sense that . Essentially, it keeps distinct count of how many \(+1\) and \(-1\) inputs it has received. Formally:

One consistent Bayesian interpretation \((\psi _0,\phi _0)\) for machine \(\gamma _0\) uses hypothesis space \(H_0 = [0, 1]\) and model:

where

This model is known as the categorical distribution for two outcomes (or just the Bernoulli distribution). The machine states were deliberately chosen to be the hyperparameters of a possible interpretation map \(\psi _0:Y_0 \rightarrow H_0\) which is known as the Dirichlet distribution (and in this special case also as Beta-distribution):

where *B*(*i*, *j*) is the Beta function.

This interpretation map (the Dirichlet distribution) is the conjugate priors for categorical distributions. This implies that \((\psi _0,\phi )\) form a consistent Bayesian inference interpretation, as explained in Appendix B.3.

### 1.3 C.3 Machine Tracking Differences Between the Number of Occurrences of Different Observations

We now consider another countably infinite deterministic machine \((Y_1,S,\gamma _1)\) which has the same input space as the machine in Appendix C.2. Let \(Y_1=\mathbb {Z}\) and the input space again be \(S = \{ +1, -1 \}\). The machine \(\gamma _0\) deterministically computes a function \(f_1: (Y_1 \times S) \rightarrow Y_1\), in the sense that . Machine \(\gamma _1\) only counts how many more \(+1\) inputs it has received than \(-1\) inputs. Formally:

One consistent Bayesian interpretation \((\psi _1,\phi _1)\) for machine \(\gamma _1\) uses hypothesis space \(H_1 = \{ h_{+1}, h_{-1} \}\) and model:

The interpretation map \(\psi _1:Y_1 \rightarrow H_1\) is

It is relatively easy to verify that \((\psi _1,\phi _1)\) is a consistent Bayesian interpretation with \(\gamma _1\)’s dynamics.

As a teaser for future work we may note the following. Since machine \(\gamma _0\) of Appendix C.2 stores the individual counts for \(s_0\) and \(s_1\) inputs, it also implicitly keeps track of the difference between those counts; \(\gamma _1\) only keeps track of this difference. Consequently, we can define a deterministic kernel \(g:Y_0 \rightarrow P(Y_1)\) that maps any state \((i, j) \in Y_0\) of \(\gamma _0\) to \(g(i,j) := \delta _{i-j}\) which is a probability measure over the state space of \(\gamma _1\). It turns out that for this map, for any \(k' \in Y_1,s \in S\) and \((i,j) \in Y_0\) we have

This implies that we can construct an interpretation of machine \(\gamma _0\) from the interpretation \((\psi _1,\phi _1)\) of \(\gamma _1\). For this we precompose the interpretation map \(\psi _1\) for \(\gamma _1\) with the machine map *g* to get a consistent Bayesian inference interpretation for \(\gamma _0\). In future work we intend to further develop the theory of how a consistent interpretation of one deterministic machine can be “pulled back” to other machines that are related in a similar way to Eq. (78).

### D Details on the Relation to the FEP

We here try to identify the structures in the FEP that are analogous to the notions of machine \(\gamma \), model \(\kappa \), and interpretation map \(\psi _H\). This suggests that, at least in some treatments of FEP, there is an implicit concept that is close to what we have called a reasoner. We will call this putative concept the FEP reasoner.

Large parts of the FEP literature do not explicitly deal with FEP reasoners but are sometimes presented as based on them (e.g. in [20]). The parts that construct the FEP reasoner are those called “Bayesian mechanics” and are still evolving. A standard reference is [19] but this is known to contain some issues [1, 9, 21]. The most recent version can be found in [16].

Understanding more precisely the relationship between the concepts of our Bayesian and the FEP reasoner is future work. The following are preliminary observations.

### 1.1 D.1 Machine

We first identify the structure in the FEP setup that is most closely related to a machine and is also said to appear to perform Bayesian inference. Unfortunately, the latest iteration of the conditions under which there exists an FEP reasoner, which is [16], does not make this particular structure as explicit as the previous version [19]. We will therefore identify this structure in the older version. A corresponding structure should also exist in the newer version and we will hint at how it may differ.

The FEP setup in [19] consists of four sets of variables \(\eta \in E,s \in S,a \in A,\mu \in M\) called external, sensory, active, and internal states with *E*, *S*, *A*, *M* finite dimensional real vector spaces. These variables obey the stochastic differential equations

where \(\omega _\eta ,\omega _s,\omega _a,\omega _\mu \) are independent Gaussian noise terms. The FEP goes beyond the scope of a reasoner and formulates a concept of agent. The concept of an agent should, as part of its interpretation, make it possible to talk about deliberate actions. In the FEP deliberate actions are associated to the active states *a*. At the same time, the internal states are only involved in inference (or filtering) and the special case where there are no active states seems to be within the scope of the FEP. This should still leave us with a FEP reasoner and make it more comparable to our Bayesian reasoner. We therefore consider the special case where there are no active states such that we get:

This looks like a continuous time version of the Bayesian network in Eq. (1) and has the somewhat significant feature that all influences from the external states \(\eta \) are mediated by the sensory states *s*. This suggests that it is possible to see the sensory states \(s \in S\) as inputs to a machine state \(\mu \in M\) with the external states \(\eta \in E\) “hidden behind” the sensory states.

The internal states \(\mu \in M\) are supposed to appear to infer the external states. So the state space *Y* of the machine of the FEP reasoner should be identified with *M*. Going by their name and their role in the earlier dynamics of Eq. (79) it seems reasonable to identify the sensory state space *S* with the input state space (also *S* in our notation) of the machine.

This brings us to the machine’s kernel \(\gamma \). Our formalism does not deal with continuous-time kernels at the moment so we only make some informal comments here. Note that none of the following statements should be considered as proven. Since all variables together form a (time-homogeneous) Markov process, we can choose times \(t,t+\tau \) with \(\tau >0\) and write the conditional probability density (assuming things are well behaved enough) at a state \((\eta ',s',\mu ')\) at \(t+\tau \) given a state \((\eta ,s,\mu )\) at time *t* as \(p(\eta ',s',\mu ',t+\tau \mid \eta ,s,\mu ,t)\) (this notation is taken from [35, p.31]). We can then marginalise out \(\eta '\) and \(s'\) to get \(p(\mu ',t+\tau \mid \eta ,s,\mu ,t)\) which looks a bit closer to a machine kernel but still depends also on \(\eta \). We cannot just drop \(\eta \) from this expression even if we assume Eq. (80) holds since within a time interval \([t,t+\tau ]\) with \(\tau >0\) the influence from \(\eta \) would propagate through the intermediate values of the sensory states to \(\mu '\). Instead we here condition on all those intermediate values of the sensory state between *t* and \(t+\tau \). Write \(s[t:t+\tau ]\) for a part of the trajectory of the sensory state between *t* and \(t+\tau \) that starts in *s*. Then, assuming Eq. (80) we should get:

In order to make this look even more like a kernel we may take the limit as \(\tau \rightarrow 0\) and so we write

which is just a notation for an expression that hopefully provides sufficient intuition for our purposes.

What is important is that within the system Eq. (80) there should be a (continuous-time) machine describing the dynamics of the internal states in response to sensory states.

In [16] the structure of Eq. (79) and thus Eq. (80) is not stated explicitly. However, the sensory (and usually the active states) are still special due to an additional assumption which is also made in [19, 34]. The larger process has to have a stationary distribution \(p(\eta ,s,a,\mu )\) that factorises according to

which is referred to as a Markov blanket. With this assumption only, one can no longer assume that the sensory states \(s[t:t+\tau ]\) can “shield” those states from direct influence by external states, which makes it more difficult to compare the dynamics to our setup. A solution may be to use a continuous-time version of the approach in [36]. Below we ignore this issue and assume that we have the structure of Eq. (80).

### 1.2 D.2 Model

For a reasoner we also need a model and an interpretation map. As already mentioned the FEP assumes that the system in Eq. (79) has a stationary distribution \(p(\eta ,s,\mu )\). One purpose of this assumption seems to be the definition of what we call the model. In the language of the FEP literature the stationary distribution defines the generative model. Here, generative model refers to a joint probability distribution over causes (parameters/hidden variables) and observed variables. In [16, Section 3.b] the generative model is defined to be \(p(\eta ,s,\mu )\) with \(\eta \) as the hidden variables and observed variables \((s,\mu )\). This could mean that the machine state \(\mu \) itself is also modelled by an FEP reasoner, which is different from our framework. This would need further investigation that we leave for future work. So we resort to a previous version where only the marginalised stationary distribution \(p(\eta ,s)\) was considered as the generative model ([34, Fig. 3],[19, p.101]). In that case the hidden variable space *H* in our notation should be identified with the external state space *E* and the model (in our sense) is a conditional distribution induced by the stationary distribution:

Note that, this choice of a model by itself does not immediately tell us whether the FEP reasoner does filtering or just inference in the sense of Definition 3. A model like can be part of a filtering kernel \(\kappa \) as well. In both cases we also need an interpretation map.

### 1.3 D.3 Interpretation map

For the interpretation map \(\psi _H\) we need a kernel of type \(M \rightarrow P(E)\). Indeed, a kernel that has the right type can be identified in the FEP literature. This kernel is denoted \(q_\mu (\eta )\) and we will identify . The kernel’s definition, however, relies on another assumption of the FEP, namely the existence of a “synchronisation map” \(\sigma :M \rightarrow E\). To construct \(\sigma \) let us first define two other functions \(g_M:S \rightarrow M\) and \(g_E:S \rightarrow E\) via

and then set

which is assumed to be well defined. For details on when this exists in the linear case see [1, 16]. With this we can define \(q_\mu (\eta )\) and in turn the interpretation map \(\psi _H\). This maps an internal state \(\mu \) to the Gaussian distribution with mean value equal to \(\sigma (\mu )\):

where the variance \(\varSigma (\mu )\) is defined as the variance of the best Gaussian approximation to the model \(p(s|\eta =\sigma (\mu ))\) when the external state is equal to \(\sigma (\mu )\) [34, Eq. 2.4]. Note that in [16] the whole stationary distribution is assumed as Gaussian and so \(p(\eta |\mu )\) in the corresponding equation in that publication (i.e. Eq. 3.3) is also a Gaussian.

In conclusion, the necessary ingredients for something like a Bayesian reasoner seem to be present in the FEP literature. One thing that is special about the FEP reasoner is that its model \(\kappa \) and interpretation map \(\psi _H\) are derived from features of the process that the machine is embedded in.

We do not know whether there is an appropriate notion of consistency equation that the FEP reasoner obeys. Presumably, instead of the equation for exact inference that we have presented, such an equation would express the idea that the FEP reasoner performs approximate inference in the form of free energy minimisation. Other differences are that the FEP takes place in continuous time, and perhaps more significantly, that it deals with deliberate actions as well as inference. However, it is not inconceivable that these could be expressed in the form of a consistency equation.

In the current formulations of the FEP, the interpretation is derived from the properties of the ‘true’ environment, such as the stationary distribution, or the synchronisation map \(\sigma \). In our consistency equation approach, this need not be the case, since a reasoner’s beliefs only need to be consistent and need not be correct. This means in particular that no stationarity assumption is needed.

Nonetheless, perhaps an important idea behind the FEP is that the model that most closely corresponds to the true environment can be considered the best one. A consistency equation approach would still be helpful, in order to systematically explore whether and how interpretations should relate to the larger process in which the machine is embedded.

## Rights and permissions

## Copyright information

© 2021 Springer Nature Switzerland AG

## About this paper

### Cite this paper

Virgo, N., Biehl, M., McGregor, S. (2021). Interpreting Dynamical Systems as Bayesian Reasoners.
In: , *et al.* Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1524. Springer, Cham. https://doi.org/10.1007/978-3-030-93736-2_52

### Download citation

DOI: https://doi.org/10.1007/978-3-030-93736-2_52

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-030-93735-5

Online ISBN: 978-3-030-93736-2

eBook Packages: Computer ScienceComputer Science (R0)