Confounding Ghost Channels and Causality: A New Approach to Causal Information Flows

Information theory provides a fundamental framework for the quantification of information flows through channels, formally Markov kernels. However, quantities such as mutual information and conditional mutual information do not necessarily reflect the causal nature of such flows. We argue that this is often the result of conditioning based on σ-algebras that are not associated with the given channels. We propose a version of the (conditional) mutual information based on families of σ-algebras that are coupled with the underlying channel. This leads to filtrations which allow us to prove a corresponding causal chain rule as a basic requirement within the presented approach.

where p(z) = P(Z = z) denotes the distribution of Z. (Throughout this introduction, we consider only variables X, Y , and Z with finite state sets X, Y, and Z, respectively.)Shannon entropy serves as a building block of further important quantities.The flow of information from a sender X to a receiver Z, for instance, can be quantified as the reduction of uncertainty about the outcome of Z based on the outcome of X.More precisely, we compare two uncertainties here, the uncertainty about the outcome of Z, that is H(Z), with the uncertainty about the outcome of Z after knowing the outcome of X, that is where p(z|x) = P(Z = z|X = x) denotes the conditional distribution of Z given X.Naturally, the latter uncertainty is smaller than or equal to the first one, leading to another fundamental quantity of information theory, the mutual information: This difference can also be expressed in geometric terms as the KL-divergence The KL-divergence plays an important role in information geometry as a canonical divergence [AN00, Ama16, AJVLS17, AA15].Such a divergence is characterised in terms of natural geometric properties.It is remarkable that this purely geometric approach yields the fundamental information-theoretic quantities which were previously derived from a set of axioms that are formulated in non-geometric terms.
Typically, the conditional distribution p(z|x) is interpreted mechanistically as a channel which receives x as an input and generates z as an output.In this interpretation, the stochasticity of a channel is considered to be the effect of external or hidden disturbances of a deterministic map.This is formalised in terms of a so-called structural equation with a deterministic map f and a noise variable U that is independent of X [Pea00].With the representation (5), the conditional probability distribution can be interpreted as a (probabilitic) causal effect of X on Z.This interpretation provides the basis for Pearl's influential proposal of a general theory of causality [Pea00].The mutual information (3) then becomes a measure of the causal information flow from X to Z [AP08], which is consistent with Shannon's original idea of the amount of information transmitted through a channel [Sha48].This consistency, however, is apparently violated when dealing with variations or extensions of the sender-receiver setting.We are now going to highlight instances of such inconsistency that will play an important role in this article.

Confounding ghost channels
The mutual information is symmetric, that is I(X; Z) = I(Z; X).Interpreting it as a measure of causal information flow, this symmetry suggests that we have the same amount of causal information flow in both directions, even though the channel goes from X to Z so that there cannot be any flow of information in the opposite direction.What is wrong here?This apparent problem, let us call it "the symmetry puzzle", can be resolved quite easily.We can revert the direction and compute the conditional distribution p(x|z) = p(x) p(z) p(z|x), based on elementary rules of probability theory and without reference to any mechanisms.Furthermore, this conditional distribution can be mechanistically interpreted and represented in terms of a structural equation ( 5).(This is always possible for a given conditional distribution.)Such a representation introduces a hypothetical channel for generating the reverted conditional distribution p(x|z), a kind of "ghost" channel that is actually not there.The mutual information then quantifies the causal information flow of this hypothetical channel.The symmetry of the mutual information simply means that the actual causal information flow in forward direction will be equal to the causal information flow of any hypothetical channel in backward direction that is capable of generating the conditional distribution p(x|z).The symmetry puzzle, however, is not the only apparent inconsistency between the (conditional) mutual information and causality.We are now going to highlight another problem, which is closely related to the symmetry puzzle but requires a deeper analysis for its solution.
We now assume that the channel receives x and y as inputs and generates z as an output.With the corresponding conditional distribution p(z|x, y) = P(Z = z|X = x, Y = y) we have the conditional mutual information of Y and Z given X: According to (6), the conditional mutual information compares the uncertainty about z given x, before and after observing the outcome y, reflected by the conditional probabilities p(z|x) and p(z|x, y), respectively.The representation (7) makes this comparison more explicit as a deviation of p(z|x, y) from p(z|x).Together with (4), we obtain the chain rule For the computation of both terms, I(X; Z) and I(Y ; Z|X), we have to evaluate the "reduced" conditional distribution p(z|x).It is obtained from the original one in the following way: This conditional distribution represents a second kind of hypothetical channel, a "ghost channel", which screens off the actual flow of information.It can be sensitive to information about x that is not necessarily employed by the original channel p(z|x, y).More precisely, given two states x, x that satisfy p(z|x, y) = p(z|x , y) for all z and all y we cannot expect p(z|x) = p(z|x ) for all z.This is a consequence of the coupling through p(y|x), on the RHS of (11).In the most extreme case, y is simply a deterministic map of x, so that the knowledge of y does not provide any additional information about z, that is p(z|x, y) = p(z|x).In the following example we study this case more explicitly and thereby highlight the inconsistency of the terms I(X; Z) and I(Y ; Z|X) in (10) with the underlying causal structure.We will argue that the conditional distribution (11) has to be modified in order to allow for a causal interpretation.
Example 1.Consider three variables X, Y, Z with values −1 and +1, and assume that Z is obtained as a copy of Y , that is This means that all information required for the output Z is contained in Y .Intuitively, we would expect from a measure of information flow to assign zero for the flow from X to Z and a positive value to the flow of information from Y to Z given X.This is however not what we get with the usual definitions of mutual information and conditional mutual information.The reason for that is the stochastic dependence of the inputs X and Y .To be more precise, let us assume that the input distribution is given as x ,y ∈{±1} e β x y , where the parameter β controls the coupling of the inputs.This implies p(x) = P(X = x) = 1/2 and p(y) = P(Y = y) = 1/2 for all x, y ∈ {±1}.We can decompose the full mutual information, as a measure of information flow from X and Y together to Z, in the following way (The subscript β indicates the dependence of the respective information-theoretic quantities on this parameter.)This is consistent with the intuition that Z is receiving all information from Y and no information from X.However, we observe an inconsistency if we decompose the full mutual information in a different way: For the two terms on the RHS of (15) we obtain These functions are shown in Figure 1.In the limit β → +∞ the two inputs become completely correlated with support (−1, −1) and (+1, +1).Correspondingly, for β → −∞ we have complete anti-correlation, and the support is (−1, +1) and (+1, −1).With (15), we obtain the following decomposition: The decomposition (16) gives the impression that Z is receiving all information from X and no information from Y .However, we know, by construction of this example, that this is not the case.The actual situation is better reflected by the decomposition (14).♦ The problem highlighted in Example 1 can be resolved by an appropriate modification of the conditional probability (11).We are now going to outline this modification, which will provide the main idea of this article.In a first step, let us assume that ȳ is fixed as an input to the channel.Which information about x does the channel then use for generating z?In order to qualitatively describe that information, we lump any two states x and x together whenever the channel cannot distinguish them, that is p(z|x, ȳ) = p(z|x , ȳ) for all z.This defines a partition α X,ȳ of the state set of X that depends on ȳ.In a second step, we consider the join of all these partitions, that is their coarsest refinement.More precisely, we define k r e r C f r x X q 3 P q a t G W s 2 s 0 / + w P r 8 A X A m k S 8 = < / l a t e x i t > I (Y ; Z|X) < l a t e x i t s h a 1 _ b a s e 6 4 A M s h F x 6 i I L l E J l R F B j + g Z v a I 3 6 8 l 6 s d 6 t j 2 l 0 w Y p n 9 t A f W F 8 / c K C a + Q = = < / l a t e x i t > log 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " P i Q y C M i k S F U Q U Z r z e D i z Z 6 q 0 D p w = " > A A A B 7 X i c b V B N S w M x E J 3 U r 1 q / q h 6 9 B I v g q e z W g h 4 L H v R Y w d Z C u 5 R s m m 1 j s 8 m S Z I W y 9 D 9 4 8 a C I V / + P N / + N a b s H b X 0 w 8 H h v h p l 5 Y S K 4 s Z 7 3 j Q p r 6 x u b W 8 X t 0 s 7 u 3 Figure 1: The mutual information I β (X; Z) and the conditional mutual information I β (Y ; Z|X) as functions of β.Even though the channel does not employ any information from X, the mutual information I β (X; Z) converges to the maximal value for β → ∞.
The partition α X represents a qualitative description of the information in X that is used by the channel p(z|x, y).Denote by A x the set in α X that contains x.When the channel receives x, in addition to y, then it does not "see" the full x but only the class A x , and it is easy to verify p(z|x, y) = p(z|A x , y).Therefore we replace the conditioning p(z|x) in the above formula (11) by This shows that the new conditional distribution p(z|x) is obtained by averaging the previous one, p(z|x), according to the information that is actually used by the channel p(z|x, y).Now, replacing in (9) the conditional distribution p(z|x) by p(z|x) leads to a corresponding modification of the mutual information and the conditional mutual information: It is easy to see that However, the sum does not change and we have the chain rule With this new definition, we come back to Example 1.The channel defined by ( 12) does not use any information from X. Therefore, α X,ȳ = {X} for all ȳ ∈ Y, which implies α X = {X}.With formula (18) we obtain p(z|x) = p(z|X) = p(z), and therefore If we compare this with ( 16) we see that the information is shifted from the first to the second term which corresponds to the variable that has the actual causal effect on Z.On the other hand, in both cases the sum of the two contributions equals log 2, the full mutual information I(X, Y ; Z).
Causality plays an important role in time series analysis.In this context, Granger causality [Gra69, Gra80] has been the subject of extensive debates which tend to highlight its non-causal nature.Schreiber proposed an information-theoretic quantification of Granger causality, referred to as transfer entropy, which is based on conditional mutual information [Sch00,BBHL16].Even though transfer entropy is an extremely useful and widely applied quantity, it is generally accepted that it has shortcomings as a measure of causal information flow.In particular, it can vanish in cases where the causal effect is the strongest possible.We argue that this is again a result of a ghost channel that is involved in the computation of the classical conditional mutual information and screens off the actual causal information flow.This is demonstrated in the following example which is taken from [AP08].Essentially, this example is a reformulation of Example 1, thereby adjusted to the context of time series and stochastic processes.
Example 2 (Transfer entropy).Consider a stochastic process (X m , Y m ), m = 1, 2, . . ., with state space {±1} 2 and define X m := (X 1 , . . ., X m ) and Y m := (Y 1 , . . ., X m ).The transfer entropy at time m is defined as Thus, the transfer entropy quantifies how much information the variables Y 1 , . . ., Y m−1 contribute to the evaluation of X m , in addition to the information in X 1 , . . ., X m−1 .We assume that the process is a Markov chain, given by a transition matrix of the form The causal structure of the dynamics is represented by the following diagram: As a stationary distribution we have The transfer entropy can be upper bounded as follows (the subscript β indicates the dependence on the coupling parameter β): For β = 0, we have an i.i.d.process with uniform distribution over the states (+1, +1), (−1, +1), (+1, −1), and (−1, −1).For β → ∞, we obtain the deterministic transition (x, y) → (−y, −y).
In this limit, the variables (X m , Y m ) are completely correlated with p(+1, +1) = p(−1, −1) = 1 2 .In both cases, β = 0 and β → ∞, the conditional mutual information I β (Y m−1 ; X m |X m−1 ), and therefore the transfer entropy T β (Y m−1 → X m ), vanishes.For β = 0, this does not represent a problem because any measure of causal information flow should vanish in the i.i.d.case.However, for β → ∞, the variable X m is causally determined by Y m−1 .Therefore, a measure of casual information flow should be maximal in this case.This is not reflected by the transfer entropy.Let us compare this with the information flow measure proposed in this article.Given that X m only depends on Y m−1 , the partition (17) is trivial, that is α = {X}.Therefore, This quantity is converging to the maximal value log 2 for β → ∞.For comparison, both functions are plotted in Figure 2. ♦ < l a t e x i t s h a 1 _ b a s e 6 4 = " t O y i 8 < l a t e x i t s h a 1 _ b a s e 6 4 = " y e E q m H w u t M c l p 3 S y 8 m 6 e 8 F + / d + 1 i 0 r n n 5 z A n 8 g f f 5 A 9 / 9 j q o = < / l a t e x i t > In what follows, we will extend the idea outlined in this section to a more general context of measurable spaces, probability measures, and Markov kernels.In further steps, we will also consider more input nodes.

General information-theoretic quantities
In the previous sections, we reviewed fundamental information-theoretic quantities as they are introduced in standard textbooks such as [CT06].In this section, we offer an alternative review from a measure-theoretic perspective (see, for instance, [Kak]).This more abstract setting will allow us to identify natural operations and definitions which are not always visible when dealing with finite state spaces.
Shannon entropy For a probability space (Ω, F , P), and a finite measurable partition γ The function h(γ|α) can be generalised by replacing the partition α by an arbitrary σ-subalgebra where Note that this function is only P-almost everywhere defined (abbreviated as P-a.e.).In the case where the σ-algebra A is given by a finite partition α with P(A) > 0 for all A ∈ α, we have This shows that the definition (27) is indeed an extension of (26).Correspondingly, integrating (27) yields a generalistaion of (25): C∈γ Ω P(C|A ) log P(C|A ) dP.
Mutual information If we subtract from the entropy of a partition γ the conditional entropy of γ given a partition α, we obtain the mutual information: Let us relate this function to the corresponding local functions h(γ) and h(γ|α).Taking the difference, we obtain If we evaluate this for ω ∈ Ω we obtain i(α; γ)(ω) = log P(Cω|Aω) P(Cω) , and thus we have For the general case where the partition α is replaced by a σ-subalgebra A of F , we obtain This leads to a corresponding generalisation of (28): Conditional mutual information Finally, we define the conditional mutual information.With two σ-subalgebras A and B of F , we define i(B; γ|A Integration of this function leads to In a final step, we could further extend i(B; γ|A ) and I(B; γ|A ) to the case where γ is replaced by a σ-algebra C , by taking the supremum over all finite partitions γ in C .However, in this article we restrict attention to a fixed finite partitions γ.
4 The chain rule as a guiding scheme

Two inputs
In the introduction, Section 1, we have used the two-input case for discrete random variables in order to highlight the main issue with the classical definitions of the mutual information and the conditional mutual information and to outline the core idea of this article.After having introduced the required information-theoretic quantities for more general variables in Section 3, we now revisit the instructive two-input case and demonstrate how measure-theoretic concepts come into play here very naturally.
Consider measurable spaces (X, X ), (Y, Y ), (Z, Z ), and their product In order to ensure the existence of various (regular versions of) conditional distributions, we need to assume that these measurable spaces carry a further structure.Typically, it is sufficient to require that (X, X ), (Y, Y ), and (Z, Z ) are Polish spaces (see [Dud02], Theorem 13.1.1),which will be implicitly assumed hereinafter for all measurable spaces.Now, consider a probability measure µ on (X × Y, X ⊗ Y ) and a Markov kernel which models a channel that takes two inputs, x ∈ X and y ∈ Y, and generates a possibly random output z ∈ Z.This allows us to define a probability measure on the joint space (Ω, F ), given by With the natural projections X : Ω → X, (x, y, z) → x, Y : Ω → Y, (x, y, z) → y, Z : Ω → Z, (x, y, z) → z, we have Furthermore, we have the marginals and, finally, the ν-push-forward measure of µ, Note that the definition of the conditional distribution P(Z ∈ C|X = x, Y = y) on the RHS of (33) is quite general and does not exclude cases where P(X = x, Y = y) = 0.It requires a formalism that goes beyond the context of variables with finitely many state sets X, Y, and Z.It is important to outline this formalism in some detail here.It will provide the basis for an appropriate definition of marginal channels.The definition of the conditional distribution involves two steps: Step 1 We interpret the indicator function 1 {Z∈C} as an element of the Hilbert space L 2 (Ω, F , P) and project it onto the (closed) linear subspace of (X, Y )-measurable functions Ω → R. Its projection is referred to as conditional expectation and denoted by Note that the elements of the Hilbert space L 2 (Ω, F , P) are equivalence classes of functions where two functions are identified if they coincide on a measurable set of probability one.Therefore, the conditional expectation (38) is almost surely well defined.
Step 2 Formally, E(1 {Z∈C} |X, Y ) is a real-valued function defined on Ω.On the other hand, it is (X, Y )-measurable so that we should be able to interpret it as a function of x and y.Indeed, it follows from the factorisation lemma that there is a unique measurable function The conditional distribution (37) is then simply defined to be the function ϕ C , which has x and y as arguments.In the special situation where we start with a Markov kernel ν, we recover it in terms of equation (33).It turns out that this equation already describes a quite general situation.Under mild conditions, assuming, for instance, that all measurable spaces are Polish spaces, the conditional distribution (37) can be considered to be a Markov kernel, as a function of x, y, and C.
For the definition of mutual information and conditional mutual information, as generalisations of ( 4) and ( 7), respectively, we have to find an appropriate notion of a marginal kernel.We begin with the conditional distribution P(Z ∈ C|X = x), as generalisation of p(z|x).For its evaluation we repeat the arguments of the above two steps and consider the conditional expectation This is an X-measurable random variable Ω → R. By the factorisation lemma we have a unique measurable function ν X (•; C) : (X, X ) → R satisfying E(1 {Z∈C} |X) = ν X (X; C), and we set Under mild conditions we can assume that ν X (x; C) defines a Markov kernel when considered as a function ν X : X × Z → [0, 1] in x and C.
We can now easily extend the classical definitions of mutual information and conditional mutual information to the context of this section.For a finite measurable partition γ of Z we can use (29) to define the mutual informations and Furthermore, with (31) we define the conditional mutual information x + y = r x y " " y x A(r, ") < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 Q e C + A J q s F C H j 1 I i e 0 S t 6 s 5 6 s F + v d + l i 0 5 q x s 5 g j 9 g f X 5 A y C h k z 4 = < / l a t e x i t > ȳ < l a t e x i t s h a 1 _ b a s e 6 4 = " z s e y t e Q U M 6 f w B 8 7 n D 6 P z j 7 w = < / l a t e x i t > A X,ȳ (r, ") This is, by definition, an random variable Ω → R. By the factorisation lemma, we can find a unique measurable function νX ( as a modification of ν X (x; C) which appears twice in (44).Note that the kernel νX (x; C) is defined almost surely.However, the definition of a conditional mutual information will be independent of the version of that kernel.
Now we come to the definition of a causal version of the mutual information (41) and the conditional mutual information (42).We simply replace in these definitions ν X (x; C) by νX (x; C): The following proposition relates the causal quantities (49) and (50) to the corresponding noncausal ones, (41) and (42).
Proposition 4. We have the chain rule Furthermore, Proof.With C z denoting the set in γ that contains z, we have Integrating this with respect to ν(x, y; dz) we get Further integrating the first term of (53) with respect to µ gives us I γ (Y → Z|X) (see ( 50)).For the corresponding integration of the second term, we obtain The crucial step (55) follows from the general property of the conditional expectation of a function f with respect to a σ-subalgebra A : Here, f is given by P(Z ∈ C|X, Y ), A is the σ-algebra generated by X, and g is given by log P(Z∈C| X) P(Z∈C) which is X-measurable.The steps (54) and (56) follow directly from the definitions of the Markov kernels, and we finally obtain I γ (X → Z) (see ( 49)).This concludes the proof of the chain rule (51).
We now prove the inequalities (52) where we can restrict attention to the first one.We consider the convex function φ(r) := r log r µ * (C) , for r > 0, and φ(0) := 0, and apply Jensen's inequality for conditional expectations: This implies The second inequality in (52) follows from the first one and the chain rule (51).
Let us interpret this result.The first inequality in (52) highlights the fact that the stochastic dependence between X and Z, here quantified by the usual mutual information I γ (X; Z), cannot be fully attributed to the causal effect of X on Z.Some part of I γ (X; Z) is purely associational, and I γ (X → Z) constitutes the causal part of it.The second inequality in (52) highlights a different fact.Conditioning on the variable X may "screen off" some part of the causal effect of Y on Z.More precisely, the uncertainty reduction about the outcome of Z through X can be so strong that a further reduction through Y becomes "invisible".Therefore, the classical conditional mutual information, I γ (Y ; Z|X), tends to reflect only part of the causal influence of Y on Z given X, I γ (Y → Z|X).Even though the classical information-theoretic quantities are replaced by their causal versions, the full mutual information can still be decomposed according to the chain rule (51).However, in comparison to the decomposition (45), some amount of it is shifted from one term to the other so that both terms can be interpreted causally.
It turns out, that the definitions (49) and (50) require a careful extension if we want to have a general chain rule for more than two input variables.We are now going to highlight this for three input variables.

Three inputs
We now consider three input variables.This will reveal that the previous case with two input variables is quite special.An extension to more than two variables requires an adjustment of our definition of causal information flow.
We consider a third input variable (denoted below by W ) with values in a measurable space (W, W ), a probability measure and an input-output channel, given by a Markov kernel This gives rise to a probability space, consisting of the measurable space and the probability measure P defined by Finally, we have the natural projections W : Ω → W, X : Ω → X, Y : Ω → Y, and Z : Ω → Z.
The definition of the marginal kernel νX (x; C), as introduced in Section 4.1, is directly applicable to the situation of three input variables.It allows us to define marginal channels by an appropriate grouping of two input variables into one input variable, which formally reduces the three-input case to a two-input case.In particular, we can define the channels νW,X (w, x; C) and νW (w; C), by grouping W, X and X, Y , respectively, into one variable.
An integration of the last term (60) with respect to µ yields, by the same reasoning as in the steps (54), (55), and (56), A corresponding integrating of the first term (58) with respect to µ yields a non-negative quantity that can be interpreted as I γ (Y → Z|W, X) (see definition (50)).Even though we will have to slightly adjust this first term, the problem we are facing here is most clearly highlighted by the second term, (59).In order to naturally generalise the chain rule (51) we have to interpret the integral of the second term as I γ (X → Z|W ).However, it turns out that, in general, where (62) is the integral of the term (59) with respect to µ.We cannot even ensure that this integral is non-negative.The reason is that the σ-algebra used for the definition of νW (w; C) is not necessarily a σ-subalgebra of the one used for the definition of the kernel νW,X (w, x; C) (the situation is similar to the one of Example 3).Therefore, the reasoning of the steps (54), (55), and (56), cannot be applied here.
The problem highlighted in this section will now be resolved.This will be done by a modification of the involved σ-algebras, which should define a filtration in order to imply a general causal version of the chain rule.In the next section, this modification will be presented for the general case of n input variables.
where B M equals the smallest σ-algebra, {∅, X M }.In that case, we obtain ν(x M ; C) = µ * (C).An adjustment of B M to the information actually used by ν will allow us to interpret ν causally.In contrast, if we do not have such an adjustment, νM will represent a hypothetical channel, a "ghost channel", based on the σ-algebra of an external observer rather than the σ-algebra of the actual mechanisms of the channel.
We now consider a family B = (B M ) M ⊆N of σ-algebras.It gives rise to a corresponding family of σ-algebras on Ω.We call the family B projective, if the maps For projective families, we have the following monotonicity: Given a projective family B, we now define a corresponding family of information-theoretic quantities which generalise (conditional) mutual information.We begin with a local version, applied to a measurable partition γ of Z.For z ∈ Z, we denote the set in γ that contains z by This is a local version of the conditional mutual information.Integration over z yields As B(R) is generated by the intervals [r − ε, r + ε] ⊆ R, the smallest σ-algebra A ν for which all functions ν(•; C) are measurable is generated by the following sets For a set M ⊆ N and a context configuration x = (x i ) i∈N \M ∈ R N \M , the (M, x)-section of A r,ε is given by sec Therefore, the M -trace of A ν , A ν M , is generated by the halfspaces For |M | = 1, we recover the half lines, so that A ν {i} = B(R).The projective extension then leads to the largest σ-algebra, the Borel algebra of R M : Therefore, the marginal channel νM (x; C) equals the usual marginal ν M (x; C) for the projective extension.For the projective reduction, on the other hand, we obtain the trivial σ-algebra except for M = N : In this case we have νM (x; C) = µ * (C) for M = N and νN (x; C) = ν(x; C), where µ is the joint distribution of the input variables.We now consider the information flows associated with L M ⊆ N , for the projective extension as well as for the projective reduction.In both cases these flows coincide with usual (conditional) mutual informations, in an instructive way.More precisely, for the extension we have For the reduction, we obtain Interestingly, (77) does not depend on L. The vanishing of the information flow for M = N is due to the fact that the output of the channel, the sum x 1 + • • • + x n , cannot be computed from a proper subset of the inputs.The flow of information only takes place if all inputs are given.♦

Conclusions
Conditioning is an important operation within the study of causality.The theory of causal networks, pioneered by Pearl [Pea00], introduces interventional conditioning as an operation, the so-called do-operation, that is fundamentally different from the classical conditioning based on the general rule P(B|A) = P(A ∩ B)/P(A).It models more appropriately experimental setups and avoids confusion with purely associational dependencies.Information theory has been classically used for the quantification of such dependencies, in terms of mutual information and conditional mutual information [Sha48].Within the original setting of information theory, the mutual information between the input and the output of a channel can be interpreted causally.In the more general context of causal networks, however, confounding effects make a distinction between associations and causal effects more difficult.In such cases, information-theoretic quantities can be misleading as measures of causal effects.In order to overcome this problem, information theory has been coupled with the interventional calculus of causal networks, and corresponding measures of causal information flow have been proposed [AK07,AP08].Given that such measures are based on the notion of an experimental intervention, which represents a perturbation of the system, it remains unclear to what extent they quantify causal information flows in the unperturbed system.As another consequence of the interventional conditioning, one cannot expect that causal information flow, as defined in [AP08], decomposes according to a chain rule.The current article is based on an idea from 2003 which precedes the above-mentioned works on combining the theory of causal networks with information theory.It proposes a way to quantify causal information flows without perturbing the system through intervention.Instead, it is based on classical conditioning in terms of the conditional distribution P(B|A ), where the σ-algebra is adjusted to the intrinsic mechanisms of the system.The derived information flow measure satisfies the chain rule and the natural properties of a general measure of causal strength postulated in [JBGWS13].The chain rule, together with the generalised Pythagoras relation from information geometry, provide powerful tools within the study of the problem of partial information decomposition [BRO + 14, LBJW18, APV20].
Even though the introduced information flows satisfy natural properties, the aim of the present article is relatively moderate.For instance, the analysis is focussed on a simple network consisting of a number of inputs and one output, which is a strong restriction compared to the setting of [AP08].The extension of the present work to more general casual networks remains to be worked out.Furthermore, this article does not address the important problem of causal inference [PJS17].In addition to these general directions of research, there are various ways to modify and extend the constructions of the present work and thereby potentially highlight further causal aspects of a given channel.The following perspectives are particularly important: 1.In the present article, the information flow has been defined for a fixed finite measurable partition γ of the state space (Z, Z ) of the output variable Z.A natural further step would be to consider the limit of information flows with respect to an increasing sequence γ n , n = 1, 2, . . ., so that ∞ n=1 σ(γ n ) = Z .
This limit will be an information flow measure that is independent of a particular partition.
2. Throughout this article, the partition γ has not been coupled with the σ-algebra of the channel ν.This is the smallest σ-algebra for which all functions ν(x; C), C ∈ Z , are measurable.Given that the channel is analysed with respect to the partition γ, one can restrict attention to the smallest σ-algebra for which the functions ν(x; C), C ∈ γ, are measurable.This will be a potentially small σ-subalgebra of the one generated by the channel.We would then have a natural coupling of the partition γ with the information used by the channel.
3. We started with the family A ν M of M -traces of A ν , the σ-algebra generated by ν, as the natural family associated with the channel.However, these traces do not form a projective family of σ-algebras.Such a projectivity is required for the chain rule for corresponding information flows.One can recover projectivity by extension and by reduction, leading to A ν M and A ν M , respectively.Example 11 shows that the extension can lead to the largest σ-algebra and the reduction to the trivial one.Given this fact, one might ask whether the extension is too large and the reduction is too small to capture the causal aspects of ν.Even though we argued above that these two projective families associated with ν capture two different kinds of causal aspects, this question remains to be further pursued.One possible direction would be the analysis of the context-dependent traces of A ν , that is the family of tr M,x (A ν ), x ∈ X N \M .Instead of conditioning with respect to the join tr M (A ν ) = x∈X N \M tr M,x (A ν ), one could adjust the conditioning to the individual σ-algebras tr M,x (A ν ).This would represent an important refinement of the presented theory.
5 w X 5 9 3 5 W L S W n G L m G P 7 A + f w B x o K O r A = = < / l a t e x i t > • • upper bound of transfer entropy < l a t e x i t s h a 1 _ b a s e 6 4 = " I h o m x 3 r U f v F 1 f 1 E j A e m p y 9 7 Q a m 8 = " > A A A C C H i c b V D L S g M x F M 3 U V 6 2 v U Z c u D B b B V Z k p g o 9 V w Y 3 L C v Y B 7 V A y m U w b m k l C k h G G o U s 3 / o o b F 4 q 4 9 R P c + T e m 7 S y 0 9 U D g c M 4 9 3 N w T S k a 1 8 b x v p 7 S y u r a + U d 6 s b G 3 v 7 O 6 5 + w d

Figure 2 :
Figure 2: Dashed line: the conditional mutual information I β (Y m−1 ; X m |X m−1 ) as an upper bound of the transfer entropy T β (Y m−1 → X m ); solid line: the causal information flow I β (Y m−1 → X m |X m−1 ) which coincides with the mutual information I β (Y m−1 ; X m ) in this example.
Denoting the set in γ that contains z by C z , we then have * (C)