Statistical Causality for Multivariate Nonlinear Time Series via Gaussian Process Models

The ability to test for statistical causality in linear and nonlinear contexts, in stationary or non-stationary settings, and to identify whether statistical causality influences trend of volatility forms a particularly important class of problems to explore in multi-modal and multivariate processes. In this paper, we develop novel testing frameworks for statistical causality in general classes of multivariate nonlinear time series models. Our framework accommodates flexible features where causality may be present in either: trend, volatility or both structural components of the general multivariate Markov processes under study. In addition, we accommodate the added possibilities of flexible structural features such as long memory and persistence in the multivariate processes when applying our semi-parametric approach to causality detection. We design a calibration procedure and formal testing procedure to detect these relationships through classes of Gaussian process models. We provide a generic framework which can be applied to a wide range of problems, including partially observed generalised diffusions or general multivariate linear or nonlinear time series models. We demonstrate several illustrative examples of features that are easily testable under our framework to study the properties of the inference procedure developed including the power of the test, sensitivity and robustness. We then illustrate our method on an interesting real data example from commodity modelling.


Introduction
There are multiple notions of causality present in the statistics, econometrics and machine learning literature. We will consider one of these which is widely known as the class of causal concepts termed "statistical causality". We therefore, do not enter into any additional debate about merits of or frameworks for other notions of causality that may be common in areas of structured learning. Quoting Wiener (1956) "For two simultaneously measured signals, if we can predict the first signal better by using the past information from the second one than by using the information without it, then we call the second signal causal to the first one." The general concept of statistical causality is based on comparing two predictive models, and this will be the case regardless of the type of predictive models used.
We seek to define a class of causality tests which is very general, and in principle agnostic to the class of underlying process that generates the time series being studied. We will achieve this via a class of semi-parametric models that we will utilise to model structural hypotheses regarding how causality may have manifested in the observed vector valued processes. To characterise testing of such relationships for the specific class of models we will develop, we will be able to explicitly evaluate a test statistics for a hypothesis test in which the asymptotic distribution is known in closed form, and under conditions discussed it can be shown to be the uniformly most powerful test. To achieve this we will introduce the class of representations we develop to characterise the observed vector valued time series according to Gaussian process models. These models are flexible in that they will efficiently allow us to test very general linear and nonlinear causality structures in the trend or volatility dynamics of the observed time series.
Throughout, we will consider without loss of generality, three multivariate time series denoted generically by t ∈ ℝ p , t ∈ ℝ p � and t ∈ ℝ̄p , which will be treated as column vectors. Our goal is to develop a framework in order to be able to assess conjectures regarding temporal relationships between these two multivariate processes t , t , in the presence of side information t , such that we can apply formal inference procedures to determine the strength of evidence for or against such conjectures. We aim to achieve this in as general a manner as possible in order to accommodate differing forms of these relationships such as linear and nonlinear as well as stationary or non-stationary relationships. The approach we adopt will not require specific assumptions on the mechanism or model that generated these two series, which is important to understand. We form a distinction between the models used to postulate and test for the presence or absence of relationships between processes in a statistical causal manner and the knowledge of the true model or data generating processes. We will propose a framework which is applicable to testing very flexible and general relationships of causality whilst still potentially being misspecified with respect to the true data generating mechanism. As such, we note that we do not seek to perfectly represent the true underlying data generating processes, we simply seek to determine plausible relationships and existence of different causality relationships. This can be seen as a more general result than trying to model exactly the true underlying processes of the time series, as our approach can be used for rapid screening of multiple hypotheses about causality prior to a more detailed model development procedure.
To facilitate such generality, we set up the testing framework based on a model of the time series causal relationships captured by Gaussian process. Note, here we are not assuming the data is necessarily truly generated by a Gaussian process, but rather that the relationships of causality may be adequately reflected by such processes. This therefore only assumes a smooth variation of the causal relationships between the partially observed time series represented by data { t } , { t } , { t } . One particularly advantageous feature of working with Gaussian process representations of the causal relationships that are conjectured to be expressed by the time series is that we are able to derive and efficiently calculate relevant test statistics to perform inference of relevance to detection of causality structures in both linear and nonlinear classes.

Perspectives on Causal Analysis
The concept of statistical causality central to this paper is only one of numerous concepts of causality that have been proposed. For centuries, causality was studied by philosophers, until the advancements in science generated a need to express this concept in mathematical terms. As a consequence, there turns out to be a wide array of possible mathematical ways to express the concepts inherent in causality. In this section we would like to pay special attention to the General Theory of Causation by Pearl, and explain how Granger's statistical causality and Pearl's theory of causation cater to different needs.
Granger made certain assumptions, that he has called axioms Granger (1980), which are still central to statistical causality: A1 Time ordering: states that the cause happens prior to the effect; A2 No redundant information: states that the cause contains unique information about the effect -it is not related via a deterministic function. A3 Consistency: says that the existence and direction of the causal relationship remain constant in time.
We note that Granger has also pointed out the contentious nature of this third axiom "[...] generally accepted, even though it is not necessarily true", which he saw as central to the applicability of the concept of causality.
However, it is important to understand that causal theory as developed by Pearl (2000Pearl ( , 2010 does not recognise these axioms. Instead Pearl postulated that causal analysis should allow inferring probabilities under static conditions, as well as how they change under dynamic conditions, by answering three types of questions: Q1 Policy evaluation: What is the effect of potential intervention? Q2 Probabilities of counterfactuals: Can an event be identified as responsible for another event? Q3 Mediation: Can causal effect be assessed as direct or indirect? Furthermore, Pearl clearly distinguished between associational and causal concepts: "An associational concept is any relationship that can be defined in terms of a joint distribution of observed variables, and a causal concept is any relationship that cannot be defined from the distribution alone" Pearl (2010). Following that criterion, causality in the sense of Granger -which here and in literature is referred to as either "statistical causality", or "Granger causality" -is an associational concept that is conditional and probabilistic in nature. According to Pearl, given adequately large sample and precise measurements, one can in principle test associational assumptions, but not causal assumptions, which require experimental control. The last one is of crucial importance to understanding the differences in the applicability of those two concepts. Statistical causality has been developed for, and is often used in studying time series, and in that context no experimental control is available in an observational context rather than a designed experimental context. We conclude this part by referring the reader to the literature that offers tools for reconciling statistical causality with Pearl's General Theory of Causation. Eichler (2001) proposed using Granger causality graphs, and Eichler and Didelez (2007) introduced interventions to the Granger causal framework in a way that is similar to Pearl's approach, and thus offer tools for reconciling statistical causality with Pearl's General Theory of Causation. White et al. (2011) demonstrated how Pearl's Causal Model and Granger causality are linked when expressed in terms of extension of Pearl's Causal Model with settable systems.

Contributions
In this section we briefly outline the novelty of the proposed statistical causality framework developed and contrast the contributions to related references Amblard et al. (2012a, b). In Amblard et al. (2012a) they propose the use of Gaussian Process (GP) models for univariate time series studies and use the data evidence obtained from the models to design a test for causality. However, these works do not explore the complete flexibility of these classes of models, where one can generalise readily to multivarite time series settings, also add side information, and they do not explore the range of causal structures that includes linear or nonlinear structures in the trend or covariance, as well as the estimation optimisation procedure, and model calibration aspects that we generalise in this manuscript. Furthermore, no sensitivity studies or analysis of the effect of model misspecifications is undertaken in the aforementioned works. Despite formulating test statistic as a difference in marginal loglikelihoods, Amblard et al. (2012a) do not exploit the properties of likelihood ratio type tests (LRT or GLRT). Therefore, in this manuscript we build upon these interesting papers and we genearlise the class of models significantly as they have not formulated the required nested structures that are achieved in this manuscript to facilitate more direct hypothesis testing frameworks through the LRT and GLRT, where we may take advantage of well known properties of the resulting test statistics asymptotic distribution under the null.
We would also like to point out that in our research we found the "kernelised Geweke's measure" of Amblard et al. (2012a) to be a nonlinear generalisation of Granger causality with many good properties, but with several shortcomings that we were able to successfully address by the use of GPs (notably: the difficulty of hyperparameter estimation, model calibration, lack of interpretability of model parameters). However the authors of Amblard et al. (2012a) have progressed from what we see as more general model (GP) to a less general model (kernelised ridge regression), which they saw as more practical for modelling instantaneous causality Amblard et al. (2012b). In our manuscript we do not address instantaneous causality (instantaneous coupling), but our framework can incorporate it, just like it can model causality between multivariate time series.
The multivariate causal testing framework developed in this manuscript allows one to incorporate aspects of causality, linear and nonlinear, in the mean and the covariance. In line with the very general definition of non-causality, models of statistical causality typically test for the equivalence of two conditional distributions. One can then differentiate approaches based on what further assumptions are made on the models. For instance, linear regression methods focus on recognising dependence in trend, under strict model assumptions, while nonlinear generalisations relax these model assumptions. These models do not, however, allow for causality in covariance, or any other nonlinear structures. The framework developed in this manuscript can accomodate these valuable extensions to allow direct straightforward testing of causality in the covariance in linear and nonlinear settings.
Secondly, analysing causal structure with Gaussian processes hasn't been done in the likelihood ratio framework, we suppose due to the complication in formulating a nested testing model structure. In this manuscript we propose a way to construct model nesting that allows for application of the likelihood ratio test (LRT) and Generalised Likelihood Ratio Test (GLRT). This model nesting is constructed to be applicable for assessing causality in the mean, or covariance, or both, and is achieved through Automatic Relevance Determination (ARD) construction of the kernel. The development of nested models is important, as the standard asymptotic distribution of the LRT test statistic under the null being 2 does not hold for non-nested hypotheses. Thus, we emphasise that the novelty does not lie in the development of the asymptotic behaviour of the test, which is standard, but in constructing a framework that allows to apply that test in this general and flexible statistical causality framework we propose. Furthermore, with our GP model formulations the test statistic can be written in a closed form, can be computed point-wise, and is efficient to compute.
There are numerous advantages of using GPs, beginning with: ease of optimisation and interpretability of hyperparameters, flexibility, richness of covariance functions, allowing for various model structures. Using a likelihood ratio type test with a GP is a very natural choice, as estimating GP model parameters is often done on the basis of maximising likelihood, and therefore this estimation can be incorporated into the compound version of the likelihood ratio test (Generalised Likelihood Ratio Test, GLRT). From Gaussian variables, GPs inherited the property of being fully specified by the mean and the covariance, and so testing for model equivalence inherently means testing for equivalence of the mean and covariance functions. But many popular kernels do not have the ARD property, and using them for a likelihood ratio test settings gives no easy way to account for causal structures in covariance. Consequently, it is using GLRT with an ARD-GP that gives a uniformly most powerful test with an unparalleled flexibility: known asymptotic distribution under the null, explicit evaluation and in a closed form, and usefulness also for misspecified models.
Thirdly, we demonstrate the ability to detect and identify causal structures in the mean and covariance, even in the presence of different types of model misspecifications. We undertake careful study of sensitivity and robustness of these testing frameworks to various features that one would encounter, like: sample size, parameter misspecification and structural misspecification. It is important as these studies demonstrate that one can reliably apply these tests in a general framework, even if the model is misspecified in those ways, and still have confidence that the inference procedure can detect these types of causality in mean and covariance incorporated in this framework reliably.

Concepts in Statistical Causality
It is important to understand the context of our proposed framework in light of the specific formulations that have been proposed before when studying the concept of statistical causality. We, therefore, briefly outline previous approaches to consider this testing framework for statistical causality and importantly the required assumptions.

Granger Causality
In this section, we present the original formulation of the hypothesis tests and test statistics for Granger causality, as well as a few of their later extended formulations that form the basis for considering concepts of statistical causality.
The first testable form of statistical causality proposed by Granger (1963) was developed in the context of linear forms of vector autoregressive models. For time series t , t and t lets assume we consider two alternative model formulations for t : with l ∈ ℕ being the maximum number of lagged observations and Y,t , ′ Y,t denoting noise (later E X t , E Y t will denote residuals). In this setting, we introduce Granger's definition: Definition 1 Granger (1963) Causality of the process Y by the process X is defined when: , and this is denoted as: X ↛ Y.
Typically the hypotheses will be given by one of the following sets of null hypothesis of Granger non-causality and alternative hypothesis of lack of Granger non-causality 1 . The version 1 of the hypothesis of non-causality is consistent with the Definition 1: In the specific case of the linear regression models from the Eqs. 1 and 2, we can also use the version 2 of the hypothesis of non-causality (which implies version 1): Granger proposed to test the null hypothesis from Eq. 3 using a test statistic called strength of causality and denoted L SC X→Y , defined as the ratio of the two variances of prediction errors: The strength of causality underlines the relationship between Granger causality and model specification tests for linear regression.
Since this instrumental work there have been numerous developments and extensions proposed. For instance Geweke (1982) proposed measure of linear feedback, with the same model assumptions and equivalent to strength of causality (Eq. 5), and defined as: (2) version 1 version 2 H 0 ∶ ∀j ∈ {1, .., l} A 21,j = 0, , where 0 ≤ L SC X→Y ≤ 1.
which will be 2 p distributed under the null hypothesis of lack of causality. Kernelised version of the Geweke's measure of linear feedback has been proposed in Amblard et al. (2012b) with the aim to make it a nonlinear method. This kernelised version of the measure of linear feedback has the same form as in the Eq. 6, but arises from a different model: kernel ridge regression, with the best predictor in the reproducing kernel Hilbert space (RKHS) generated by the associated kernel.
In the kernel ridge regression the solution is no longer represented in terms of optimised coefficients A ⋅⋅,j from Eqs. 1 and 2, the so called primal solution which we will denote as . Instead, the dual solution krr are such coefficients, that allow the solution to be represented in terms of inner product of the covariates (independent variables). Below, we introduce notation that will allow convenient matrix operations, and that will be used throughout also in the later sections: For Model A, we have ℚ t ∶= t , t , ℚ ∶= −l , −m , and if these two need to be distinguished, we will add subscript referring to the model: ℚ B,t , ℚ A,t . Here the matrix ℚ represents all available covariate data, and the matrix ℚℚ T ∶ T × T is a Gramm matrix of the covariates data -it is in such a form that admits application of the kernel trick by "substituting" it with a Gramm matrix which we will denote ℚ . The Gramm matrix ℚ , also called kernel matrix or covariance matrix, can be defined element-wise as Mercer kernel function evaluations: ℚ i,j = k( i−l∶i−1 , j−l∶j−1 ) . The Mercer kernel function k(⋅, ⋅) ∈ M(X) is a real, symmetric and semi-positive definite kernel function, defined on the domain X × X . Then, the optimal weights, fitted values and mean square of prediction error will for kernel ridge regression be as follows: When kernel ridge regression is applied to model A, or model B, all of the steps above are applied, but with different definition of Q t , and therefore different values of the covariance matrix ℚ . Denoting the fitted values as ̂ A 1∶T and ̂ B 1∶T , we obtain the mean square errors of kernel ridge regression prediction of the two models: Var ̂ A 1∶T − 1∶T and Var ̂ B 1∶T − 1∶T , which are used in the test statistic in a similar manner to the strength of causality from Eq. 5, and to the test statistic from Eq. 6. Thus optimal weights: the test statistic based on the kernelised ridge regression, that Amblard et al. (2012b) proposed is formulated as follows: The hypotheses are: See Amblard et al. (2012b); Zaremba and Aste (2014). We also refer to Lungarella et al. (2007) for other generalisations of Granger causality.

Transfer entropy
A third set of hypothesis has been subsequently introduced, which also relies on concepts of conditional independence, see Granger (1980);Eichler (2001); Amblard et al. (2012b): The hypotheses in Eq. 8 were a starting point for a wide range of other tests, many of which would no longer assume the linear form of the models in the Eqs. 1 and 2, see Schreiber (2000); Lungarella et al. (2007); Chen (2006). One of the more important papers here is the one by Schreiber (2000) who introduced the information theoretic approach to modelling causality by proposing transfer entropy, which is now one of the most popular nonlinear statistical causality measures. Transfer entropy is defined as a difference of two conditional entropies: where is the differential (continuous) entropy and is the conditional entropy.
The asymptotic properties of transfer entropy are analysed less often. However Barnett and Bossomaier (2012) prove that for ergodic processes, the transfer entropy is a loglikelihood ratio, asymptotically distributed according to the 2 distribution under the null hypothesis of lack of causality, and having asymptotic non-central 2 distribution for the alternative hypothesis. The rate of convergence of the test statistics asymptotic distribution can however be problematic in practice, requiring very large sample sizes for valid application of the test according to the asymptotic distribution (please refer to the experimental results, Sect. 5.6). In general, the null hypothesis from Eq. 8 is not equivalent to neither that from Eq. 3 nor from Eq. 4. For the linear model from Eqs. 1 and 2, and with the assumptions of Y,t , ′ Y,t having the same distributions, the null hypothesis from Eq. 4 implies the null hypothesis from Eq. 8. Furthermore, under the normality assumptions, the test statistic L TE X→Y is equivalent to both L SC X→Y and L MLF X→Y , see Barnett and Bossomaier (2012).

Semi-Parametric Nonlinear Causal Process Representations
We begin by defining Gaussian Processes, as this will serve as our base class of stochastic processes that we adopt to characterise different examples of causality model structures. The vector valued time series t is described by a Gaussian Process model, which is denoted as GP and defined as follows: Definition 2 (Gaussian Process (GP)) Denote by f ( ) ∶ X ↦ ℝ a stochastic process parametrised by { } ∈ X , where X ⊆ ℝ p . Then, the random function f ( ) is a Gaussian process if all its finite dimensional distributions are Gaussian, where for any N ∈ ℕ , the random vector f 1 , f 2 , … , f N is jointly normally distributed, see Rasmussen and Williams (2006).
We can therefore interpret a GP as formally defined by the following class of random functions: At each point the mean of the function is (⋅; ), parametrised by , and the dependence between any two points is given by the covariance function, also called Mercer kernel: k ⋅, ⋅; k ∶ M(X) , parametrised by k , see detailed discussion in Rasmussen and Williams (2006). We will later use notation = ∪ k , and will refer to as hyperparameters of the GP random function f.
We then model the time series t causal relationships as realisations 2 from a GP f (⋅) with additive Gaussian noise t . with the following generic definition of the mean function t ∶ ℝ kp+lp � +mp → ℝ and the covariance function k t,s ∶ ℝ kp+lp � +mp × ℝ kp+lp � +mp → ℝ: It will be useful to make the following notational definitions for the mean vector, and correlation matrix, respectively: and SPD T is the manifold of symmetric positive definite matrices of size T × T.

Covariance Functions and Automatic Relevance Determination for Causality
As is standard in GP modelling, we will represent the covariance functions with functions that are known as kernels, and we will focus on the class of Mercer kernels M(X).
Definition 3 (Semi-positive definite kernel) A function k ∶ X × X → ℝ is called a semipositive definite kernel kernel (positive definite) if and only if it is symmetric, that is, ∀ , � ∈ X, k( , � ) = k( � , ) and semi-positive definite, that is There are several important properties of kernels, see Scholkopf and Smola (2001). A centered GP is uniquely determined by its covariance function (semi-positive definite kernel). Conversely, any semi-positive definite kernel determines a covariance function and a unique centered GP, see Hein and Bousquet (2004). Moreover, there exists a bijection between the set of all real-valued semi-positive kernels on some space X and the set of all centered GPs defined on X . Kernels can also be seen as inner products, see Schoelkopf et al. (2004).
An important concept that will be broadly used in the context of kernel classes is the concept of Automatic Relevance Determination (ARD). It has been initially introduced by MacKay (1994), as a Bayesian model where input relevance can be introduced and controlled with parameters; see also Neal (1996). This has later become popular in a wider context of feature selection and sparse learning in Bayesian models, see Qi et al. (2004). We use the same concept, but for a purpose of ensuring we have nested models for inference hypothesis design (see Sect. 4.1.3), and it will be crucial when applying the Generalised Likelihood Ratio Test.
In the ARD model, each input variable has an associated hyperparameter whose value can scale the effect of that input. In the Bayesian approach, this is achieved by setting a separate Gaussian prior for each of the inputs. In our (frequentist) case we treat each dimension as a separate input and define our mean and covariance functions in such a manner that the effect of each of the univariate inputs can be separately changed through zeroing of the hyperparameter associated with the given marginal input component. In particular, by setting specific values of the hyperparameters we can practically eliminate some of the univariate variables from the mean/covariance. This construction has several important advantages: it allows for marginal causality testing as well as developing a class of nested model structures, critical to determining the statistical significance of causality relationships under consideration. In the table below (Table 1) are two examples of popular kernels and their ARD versions. Rasmussen and Williams in their MATLAB toolbox provide an ARD version of the squared exponential kernel with diag l −2 1 , ..., l −2 n , our version from the Table 1 allows to choose l i = 0 which removes the effect of the i-th dimension of input on the kernel. As a result, the covariance for lower dimensional space can be expressed as a covariance with a higher dimensional space . Given a set of input points i |i = 1, ..., n we can compute the Gram (covariance) matrix whose entries are K ij = k( i , j ).

Characterising Causality Hypotheses With Gaussian Process Models
When performing inferential tests for statistical causality one will typically compare two alternative model hypotheses. We have already seen in the Sect. 1, that such hypotheses can be formulated in multiple ways, see Eqs. 3, 4 and 8. In defining the non-causality tests, we start from the more general forms of the hypotheses outlined in Eq. 8. Table 1 Summary of several popular kernel functions. We are using the following notation: p 1 ≤ p is the dimension of vectors u , v , and T u,[1∶p 1 ] = x u,1 , ..., x u,p 1 , A is a constant positive definite matrix, a, c are a constants, l is a lengthscale parameter, and l 1 , ..., l p is a vector of lengthscale parameters, d = || u − v || represents a distance, e.g. an Euclidean distance, = ||x u,1 − x v,1 ||, ..., ||x u,p − x v,p || , and k 1 (⋅), k 2 (⋅) are stationary kernels covariance function The two causal model structures are generically represented as multi-dimensional Gaussian process time series models observed in additive Gaussian noise and denoted by Model A and Model B in the Eqs. 9 and 10 respectively: with the following forms of mean functions We assume the mean and covariance functions, A , k A and respectively B , k B , have similar functional forms and only differ in dimensionality and hyperparameters.
Having defined these two models we may now state the form of the hypotheses for testing for non-causality (lack of causality) in nonlinear times series. The test that allows comparing two models from the Eqs. 9 and 10 is fundamentally a test comparing two distributions -the conditional distribution of the time series { t } conditioned on inputs from either of the two models. As it was already mentioned, we never actually confirm the statistical causality, but rather reject lack of causality (test for non-causality).
Under such a test, the null hypothesis is that there is no causal relationship from time series { t } to { t } , and including the past of { t } does not improve the prediction of { t } . Given the model formulations, this means equality of conditional distribution of , conditioning on either set of explanatory variables (analogously to Eq. 8): The distributions above can be obtained in closed form only in the case of additive Gaussian noise, or in cases where there is no assumed additive noise in Model A or model B.
Since a GP is also specified by its sufficient mean and covariance functions, testing for equality of distributions will be equivalent to testing for equality of the mean functions and the covariance functions. Hence, the convenient feature of the causality testing framework developed from the GP framework we propose is that these general distributional statements about population quantities in the null and alternative hypotheses are equivalent to the following population statements on mean and covariance functions.
where M represents the class of all Mercer kernels. If the classes of mean and covariance functions are restricted so that the Model A is nested in the Model B (defined in the Subsect. 4.1.3), then the above hypotheses can be tested with the Generalised Likelihood Ratio Test.

Generalised Likelihood Ratio Test
The GLRT is a composite hypothesis test that can be used in the case of nested hypothesis if the parameters are unknown and need to be estimated. Below we describe the test, using notation from Garthwaite et al. (2002). The GLRT gives us asymptotic distribution of the test statistics, but it requires that the hypotheses are nested -which can be expressed in terms of restriction on mean and covariance formulations.

Theory for Generalised Likelihood Ratio Test
Let 1 , 2 , ..., N be a random sample of size N from a distribution with pdf p( ; ) , and suppose that we wish to test: H 0 ∶ ∈ vs H 1 ∶ ∈ Ω − . Then define a random variable: where L( ; ) = p( ; ) is the likelihood function. For some constant A, we can use a test with critical region Λ ≤ A.
If we define q as the difference in dimensionality of H 0 and H 0 ∪ H 1 , then we have that under the null, the asymptotic distribution of the test statistic is distributed according to: We would like to emphasise that the GLRT test compares the likelihoods of parameters either belonging to the whole parameter space Ω , or to its subset ∈ Ω (Eq. 13). This nesting of parameter spaces will be the basis for defining nested hypotheses in Definition (4).

Generalised Likelihood Ratio Test for Testing Causality
Let us refer to the null hypothesis of non-causality as it was formed in the Eq. 11. The likelihood ratio test can be rewritten in terms of a difference of two marginal log-likelihoods , and it leads to the definition of a causality test statistic L X→Y|Z , first proposed by Amblard et al. (2012a): In this paper we assume additive Gaussian errors, which allows us to calculate the marginal likelihoods analytically. For the calculations please refer to the Appendix 1. The resulting distributions are: If we use the hat notation for MLE estimators of the hyperparameters of the mean and covariance functions, then the test statistic is given by: In the Eq. 15 we present a general form of the test statistic for multivariate time series, and in the special case of a univariate time series this simplifies to a form from the Eq. 16. Distinguishing between the two definitions can also be seen as a distinction between joint causality and marginal causality.
Under certain regularity conditions, with the assumptions of conditional independence of t | −k t−1 , −l t−1 , −m t−1 for all t, and with the assumption that models A and B are nested (see 4.1.3) we can treat L X→Y|Z as a GLRT and use the asymptotic results: where q is the difference in dimensionality between the parameter space for A and B .

Nested Models
An essential concept in our testing procedures is that of nested models. Its importance arises from the fact that the Generalised Likelihood Ratio Test (GLRT) on nested hypotheses has known asymptotic distribution. (14) Definition 4 Nested models. Two models: M A parametrised by A and M B parametrised by B are said to be nested if it is possible to derive one from another by means of parametric restriction, see Clarke (2000) Intuitively, we could say that model A is nested in model B if the input space of model A is embedded in input space of model B, but the Definition 4 is formulated in terms of embedding of the model parameter spaces, rather than embedding of the input spaces. Formulating our Gaussian Process models A and B in such a way that they are nested according to the above definition is not always possible. This is because for the above definition of nested models we require the mean and covariance function to have parameters that correspond to the dimensionality of the input space, or that correspond to the inclusion or not of the input X.
In practice, when we talk about nested models we consider mean and kernel functions allowing the nested model representation. The simplest example of how the mean and kernel functions can allow nested models are for linear mean and kernel functions. Define Analogously, for the linear kernel: A popular kernel function that does not allow nested models is squared exponential kernel: which, however, can be extended to a representation under an ARD structure, which does have a form that allows for nested models (see Subsect. 3.1 and the Table 1), if the division by a scalar lengthscale parameter 2l 2 is replaced by a multiplication by the following matrix of lengthscale parameters: diag l 2 X , l 2 Y , l 2 Z . If the nested model representation is not practical, then GLRT test should not be used. There are several approaches for non-nested models: modified (centered) loglikelihood ratio procedure -Cox procedure, "comprehensive model approach", "encompassing procedure", Vuong closeness test: likelihood-ratio-based test for model selection using the Kullback-Leibler information criterion. We refer the reader to the following papers (and references therein): Vuong (1989); MacKinnon (1983); Pesaran and Weeks (2001);and Wilson (2015).

Synthetic Data Experiments to Assess Proposed Causality Testing Framework
In this section, we seek to study the behaviour of our proposed methodology for GP testing of statistical causality relationships. In order to motivate the causality studies in this paper, we consider three illustrative nonlinear time series models. They will serve as references that we will apply our causality testing framework to, throughout the synthetic studies undertaken in the results analysis for testing power, sensitivity, and robustness of our proposed causality testing framework.
In particular the classes of model we have chosen as illustrations of data generating processes for the time series that will form inputs to our testing framework characterise a range of general model structures which allow for assessment of linear and nonlinear causality structures in the trend or the volatility or both components of the resulting data generating models. The examples that we will use will assume q = 2 , which means that in the mean this time series will have a nonlinear causality in the direction Y → Z , aside from the linear causality X → Y.

Example Time Series
We will express the model from the Eq. 17 in the form of three GPs, as in the Eq. 18. When generating the data, as Eq. 20 show, we will use Matern covariance functions with degrees of freedom = 1.5 , we will also extend the model to allow causal relationship in covariance -relationships, that were not existing in the time series formulations from Eq. 17.
A formulation of the time series from the Eq. 17 explicitly as GPs can be done according to the following conditional distributions: where the mean functions are linear: and covariance functions incorporate the noise which was already defined as a GP: nonlinear causality Note that the main causality structure has been encoded in the mean functions, but the way the covariance functions are formulated allows some causality in the covariance in the directions X → Y and Y → Z.

Example Time Series Model Class 2: Structural Causality Incorporated in Volatility
The second causality structure has similar autoregressive and causal components to the Structure 1, but the error terms depend on past values of the other time series (so no autoregression in the covariance) via nonlinear functions f y , f z : where The formulation above is general and the noise terms y , z can depend explicitly on time via the functions g y (t) and g z (t) . We use c y , c z , d y , d z , p, q to denote constants. For this time series to be expressed in terms of GP we will have exactly the same general GP structure as for the time series 1 in the Eq. 18, and exactly the same mean functions -the Eq. 19. To construct the kernels that will match the covariance structure, we use the properties that summations and multiplications of kernels yield new kernels, for example as follows:

Example Time Series Model Class 3: Causality Features in Presence of Long Memory
The third data structure is a long memory process: ARFIMA(p,d,q), for d ∈ [0, 0.5) , with causality structure encoded in the form of external regressors: where B is a backshift operator, the autoregressive coefficients for the time series Y t , Z t include external regressors, the moving average coefficient according to characteristic polynomial: Θ(B) = 1 − 1 B − ... − q B q , and the long memory operator has linear process series expansion given for d ∈ (0, 0.5) as follows: In this example, there is no natural way to trivially develop a GP representation, however, it does not preclude fitting a misspecified model in order to screen for causality structures that may be present. We can fit such a model to partial observations of this reference example. This poses an interesting example to study the effect of model misspecification on the ability to detect linear and nonlinear causality structures.

Synthetic Data Experiments
In this section, we provide results for a series of tests of performance focusing on three key attributes of the proposed causality inference framework: power, sensitivity to parameters and robustness to model misspecification or parameter estimation errors. We perform these analyses for each of the three case study models introduced. We begin with sensitivity and misspecification tests, which we follow with experiments on the power of the test for simple and compound tests.
The sensitivity analysis shows how the test reacts to varying the parameter values used to generate the time series data in Example model 1, Eqs. 18-20. Here, we know the exact model so that a simple test is performed, where we assess its power over the parameter space.
The model misspecification tests show how the test reacts to discrepancy between the parameter values used to generate the time series data and the parameters used in the test statistic. This is a structured form of compound test analysis, since in practical settings in general the parameters will be estimated from data and then used in a compound testing procedure, in which the test statistics is a function of the estimated parameters.
We begin with two simple illustrative examples showing how the values of the test statistics from Eq. 14 change for different data samples, and what values of the 2 cdf they obtain. Throughout, we will perform analysis relative to the level of significance for the test of 10%. The Fig. 1 illustrates a compound test with optimised parameters -showing the values of test statistics L X→Y vs L Y→X and the 1-p values, or the evaluations of the distribution 2 2 (2L X→Y ) vs 2 2 (2L Y→X ) . The data has been generated from causality structure 1 with strong causal effect X → Y , with each of the 50 data replicate time series samples being of length 500 sample points.
The interpretation of the Fig. 1 is the following. From the left plot we can see that the test statistics L X→Y has values which are separated from and considerably larger than the test statistics L Y→X . This by itself is an indication that the causal effect X → Y should be stronger than Y → X . From the plot of cdf evaluations we observe that all of the values of L X→Y are in the tail (with cdf values of exactly 1) and therefore the null hypothesis is strongly rejected at any confidence level, for each of the trials. This means that the estimator of the power of the test, i.e. the probability of rejecting the null hypothesis if it is not true, is very close to 1 for a very large range of confidence levels, certainly between 0.01% and 10% This indicates that, as expected, the test performs very well in detecting the correct direction of causality -in this case Y → X.

Model Sensitivity Analysis
It is important to ensure that, on one hand, the tests behave in a stable way when the parameters change -at least in some non-extreme region -and, on the other hand, that the tests are not heavily penalising misspecifications.
This test is performed for the first data structure, Eqs. 18-20. We use the following settings: Matern kernel, additive noise with variance of 2 n = 0.01 , grid of 21 different parameter values for each variation of the true model parameters assessed. For each experiment we consider 100 trials and the length of the simulated time series varies over range 20, 50, 100, 200, 500, 1000. We report rejection or lack of rejection of the test with the significance of = 0.1 . The starting point is the parameter set: a X = a Y = a Z = 0.3 and b Y = b Z = 0.7 (parameters of, respectively, autoregression and causality in the mean, as per Eq. 19), l a = l b = e −1 , f = e −3 , n = 0.1 (covariance parameters: autoregression, causality, multiplicative scaling, noise covariance, Eq. 20). Parameters are changed one at a time, and a new set of data is generated for each set of parameters.
We do not report results of the sensitivity test for the directions without causality: Y → X or Z → X , as the test statistics in those cases will always be zero. When changing parameters in both models at the same time, we no longer use the true parameters, but we still compare models that are equivalent.
In the direction with causality X → Y we see that the behaviour of the test is very stable, with the changes in the frequency of rejection/non-rejection (here presented as estimated power of the test) influenced mostly by the sample size. The power of the test is the probability P(H 0 rejected|H 1 true) , which in our case is estimated as 0.01 ⋅ ∑ 100 i F(2L X i →Y i ) , where we have 100 trials, F denotes the cdf of 2 2 and 0.9 is 1 -confidence level. When compared to the X → Y direction, the results for Y → Z are less uniform, as shown in Table 2. The Table 2 demonstrates the power of the test for minimum and maximum of the parameter range, which is enough to portray the behaviour of the test for all parameters except f for the Y → Z direction, for which local minimum can be seen in the Fig. 2. Based on the Table 2, and corresponding Fig. 2, we can also observe that the results for Y → Z are more sensitive to the change in parameters than the results for X → Y , in particular the causal coefficient b Z .

Model Misspecification Analysis
For the misclassification test we have chosen different starting settings for the covariance function l a = l b = e −3 , f = e 1 , which result in higher covariance, and much more pronounced effects of misclassification of covariance function parameters. Starting from the base set of parameters we alter one parameter at a time when calculating the test statistic; however, we use data generated for the base parameters: so that altered parameter is misspecified. It has to be emphasised that in the misspecification test a parameter will be altered for model A or model B, but not both.
Results of misclassification in the mean, which we do not report, are straightforward to understand and interpret. The power of the test depends mostly on the size of the Table 2 How power of the test changes with length of the time series (n) and changes of single parameters. Default parameters: a X = a Y = a Z = 0.3, b Y = b Z = 0.7, q = 2, l a = l b = e −1 , f = e −3 , n = 0.1. , one of the mean or covariance parameters changes ±50% in simulation and model as well. We look at time series of length n = 20, 50, 100, 200, 500, 1000 . The parameter values correspond to the values in Fig. 2 Throughout these tests the 50% change in the parameters relates to the model parameters; The actual decrease/increase for covariance parameters is much bigger than for the mean, because the former is inputed to the algorithm as a logarithm .00 n = 100 0.92, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 0.96 1.00, 0.70 0.91, 1.00 n = 200 0.99, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 0.87 0.99, 1.00 n = 500 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 0.99 1.00, 1.00 n = 1000 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 1.00, 1.00 sample and, to a smaller degree, on the deviation from the true mean. For the direction where causality exists, the power of the test changes almost uniformly with the misclassification of the mean parameter. This is in line with observations that we will see repeatedly -that the power of the test is more robust to any parameter changes in the presence of causality in the mean.
Results of misclassification in the covariance, Figs. 3 and 4, are not so straightforward to understand and interpret. In particular, the performance of the tests seems to be more sensitive to the misclassification of the strength of the observation noise -this is not observed when parameters of the covariance (mainly f ) are smaller.
Heatmaps show power of the test (hypothesis of no-causality rejected for cdf above 0.9) for different lengths of the time series and for one of the mean or covariance parameters changing + − 50% in simulation and model as well Fig. 3 Power of the test of the hypothesis of non-causality in the direction X → Y changes with the sample size and misspecification of a single hyperparameter (here -covariance parameters)

Summary of the Section: Analysing power of the test (1-rate of type II error) is a popular
technique of assessing the quality of a test or a testing procedure. It is expected that the power of the test will increase with increasing sample size, and showing that this is indeed the case for our testing procedure will be the focus of this and the following sections. We start by analysing the results of simple tests, where exact parameters are used, and there is no effect of parameter misspecification. Strictly speaking, the simple test can be performed only for the first two data structures, as the third has been defined as an econometric model  Figure 5 shows evolution of receiver operating characteristic (ROC) curves with increasing sample length, for two sets of parameters. When performing simple test, for most of the parameters, the ROC curves will show that positives and negatives are almost always properly classified, even for short time series -as seen in the example in the left chart of Fig. 5. This example represents testing of model 1 with true parameters: a X = 0.3, b Y = 0.7 for mean function, and l a = e −3 , l b = e −1 , 2 f = e −10 as the kernel parameters. The corresponding distributions of 1-power of the test can be seen on the left chart of the Fig. 6, in the form of boxplots. These distributions have medians at 1 for samples of length from 50 up, and no outliers for samples of length 500 and 1000.
The notable exceptions observed are as detailed below. Firstly, we show an example for which a higher rate of misclassification is seen, albeit it still decreases with the size of the sample. The right chart in the Fig. 5, has larger value of 2 f = e −2 ≃ 0.1353 , but with other mean and kernel hyperparameters remaining the same. The middle chart of the Fig. 6 shows that in this case, even for the sample of length 500 we still can observe some outliers with 1-power of the test at 0.
The right chart of the Fig. 6 shows an extreme case, where the power of the test degrades with length of the time series to a random coin flip on the hypothesis, although it improves if we consider exceptionally long samples of 5000 data points. We can see that the medians of the distributions of 1-power of the test drops from 1 to 0 for samples of length 100-1000, and gets back to 1 for sample of length 5000. The kernel hyperparameters in this case are equal: l a = e −1 , l b = e −3 , 2 f = e −2 . This means signal variance at the same level as the less extreme case, but bigger autoregressive hyperparameter and smaller causal hyperparameter in the covariance function. Those parameter values, where increasing the sample size temporarily causes decrease of power of the test, can correspond to the dark areas from the Figs. 3 and 4. The right plot show an extreme case of performance decreasing with sample size for the typical range of sizes, hence the addition of results for data f length 5000 1 3 The parameters that cause such behaviour is primarily the signal variance 2 f , and to a smaller extent l a -the coefficient of autoregression in covariance function. The hyperparameter 2 f increases the value of the covariance proportionately, while l a -inversely and less than proportionately. Higher values of the covariance function mean higher volatility clustering, an effect which could compete with causality, but that could be less visible in short time series. We will not elaborate on this point here, but additional dependence structure can complicate the explanation of causality structure. Therefore longer time series appears necessary to correctly recognise causality in this case. The Fig. 7 shows the effect of length of a time series on the value of the test statistics L X→Y for a particular combination of parameters. A single data set of length 5000 has been simulated and subsequently tests statistics have been calculated on the first 100, 200, 300, ...5000 data points. The chosen data set has a general trend of test statistics increasing for longer data lengths (as for all other data sets generated with the same parameters) but it shows to major dips of test statistics temporarily worsening.
The causal effect in the covariance function is difficult to observe. This is because on one hand, it seems to have a much subtler effect than the causality in mean, but also because it is entwined with other effects that can be observed for different parameter combinations. Figure 8 shows that for following parameters b Y = 0, a Y = 0, l a = e, l b = e, 2 f = e 4 the causality in covariance is unambiguously observed already for sample size of 50. As a reminder, according to the Eq. 19, b Y = 0 means no causality in the mean and a Y = 0 means no autoregression in the mean.

Example Time Series Model Structure 2
The results for testing of model 2, Eqs. 21-23, are just commented on here, since in the simple testing framework they do not show anything unexpected. In particular, the power of the test does increase with increasing length of the time series. Arguably, there is much less opportunity for problematic behaviour. This is firstly because the range of parameters which are available for the Example structure 2 is much narrower than for the Example structure 1 (i.e. parameters for which the series does not explode to infinity). Secondly, we assumed cov( Y t , Y t � ) = 0 , but if we did not we could have had again the problem with volatility clustering masquerading as causality.

Example Time Series Model Structure 3
We do, however, report a few observations on the testing of model 3. Firstly, model 3 does not have a GP representation, so when reporting on the results of the "simple test" in this case we do not perform a test with "true" parameters, but a test with fixed, rather than optimised, parameters. These observations become particularly interesting when compared with the results of the compound test for the data generated from the model 3. The main property of interest in the model 3 is the long memory, and this is what we concentrate on here. When analysing results for the data generated from model 3 (simple or compound test), on one hand, we expect that existence of the long memory will make recognition of causality more difficult, but on the other hand, we would like to see that causality can still be reasonably detected. Figure 9 shows how the power of the test is affected by increasing the long memory (values of the parameter d = 0.1 vs d = 0.45 ), and how this effect can be increased by changing other parameters (the degree of moving average from MA(1) to MA(4), noise covariance from 2 = 0.1 to 2 = 10 , strength of linear causality from b Y = 0.7 to b Y = 0.2 ). It is worth emphasizing that decreasing strength of causality has the biggest influence, and is the only factor that affects the power of the test for long time series (length = 1000).

Power of the Hypothesis Tests: Compound Tests
Summary of the Section Compound tests are two stage tests where both the likelihood as well as the model parameters are estimated. Robust estimation of parameters while possibly costly, is one of the most important pillars of robust testing with compound tests. In this section we want to draw attention of the reader to a few important phenomena: firstly, that the framework is much better Fig. 9 The effect of the rate of decay of autocorrelation on the power of the test in model 3 varies strongly with different parameters in picking up causality than accepting the lack of causality; and secondly, that even with strong model misspecification -which we will see for the model 3 -it is possible to identify causality.
One of the biggest factors influencing quality of the compound test is the efficiency of the optimisation algorithm. The objective function obtained from maximisation of the likelihood for parameter estimation produces generally a non-convex optimisation problem, which means that existence of local optima is likely. Using multiple starting points is highly recommended, but can potentially make the calculations very time consuming (our implementation involves a random grid of starting points). Using GPs with the assumptions we made in this paper (mainly: additive Gaussian noise) offers the advantage of being able to calculate the likelihood analytically. However, it is still possible that the data set can be so large, that this calculation will be prohibitively expensive. A popular approach in the literature is to decrease the dimensionality of the input data, see Snelson and Ghahramani (2007), or strive for efficient implementation, see Rasmussen and Williams (2006). An interesting and little known approach is to choose covariance function that promotes sparsity of the covariance matrix, as proposed by Melkumyan and Ramos (2009). Ensuring an approach is applicable to time series potentially adds a level of complication.
Example Time Series Model Structure 1 An observation that arguably holds for all datanot only the Model Structure 1 -is that when causality does exist in the data, the distribution of the test statistics estimator is much narrower than when there is no causality. An example is shown in the Fig. 10: the first plot shows that the causal signal can be picked up even for the shortest data, and the distribution of the tests statistics converges to value 1 already for length 100. When causality is not present (subplots 2 to 4) even for the longest used samples the distributions of test statistics are wide with median at zero, but 75th percentile often reaching close to 1.

Example Time Series Model Structure 2
The results of testing of model 2 show some very interesting behaviours. When fitting the model, we introduced model misspecification, because we allowed the structures to be the same for both directions. The first misspecification is in using polynomial means of second degree for Y → Z | X as well as Z → Y | X . The second misspecification is in using the same volatility structure for both X → Y | Z and Y → X | Z . As a result the estimated parameters in mean are often correctly estimated to be near zero, but the parameters in variance are strongly misspecified. The results still have reasonable power Fig. 10 Boxplots showing how the sample size affects distributions of the test statistics, in the case of existing causal effect (first subplot X → Y and b Y = 0.7 ) and in the case where causal effect disappears due to causal coefficient equal to zero (second subplot X → Y and b Y = 0 ), construction (third subplot Y → X ) or both (fourth subplot) of the test: the existence of causality is always correctly identified, however, in some cases the results could be interpreted as spurious causality. Also, like with model 1, there are cases where we seem to be spotting the causal effect in the covariance function when there is no causality in the mean, shown in the Fig. 11.
At the same time, we see that spurious causality signals are detected for the opposite direction: Y → X | Z . Figure 12 shows how in the presence of causality X → Y | Z ( b Y = 0.7 ), the opposite direction also starts displaying causality with growing sample size. Explaining spurious causality is often complicated. In this case, we want to emphasise the following observations. First of all, the value of the test statistics is much bigger for the side where true causality exists, and a much smaller sample is needed to start indicating that causality with high confidence. Secondly, we run a misspecified model for the Y → X | Z direction (the misspecification is in the covariance function, with the multiplicative parameter f having to equal zero to achieve properly specified function consisting of the multiplicative noise only), and even with multiple starting points, the optimised parameters are not as close to the true parameters as would be desired. Fig. 11 Model 2, X → Y | Z . Changes in recognition of causality with increasing sample size: different parameter settings. The top row shows the parameter settings where causal effect in covariance can be expected ( c Y ≠ 0 ), while the bottom row shows cases where causality in covariance is not expected ( c Y = 0 ). In all the cases there was no causality in the mean ( b Y = 0) Fig. 12 Model 2, Y → X | Z . Changes in recognition of causality with increasing sample size: different parameter settings. No true causality in the direction Y → X | Z , but there was causality in the opposite direction ( b Y = 0.7 ). The parameter that affects recognition of spurious causality is the additive parameter g, whose higher absolute values tend to increase covariance

Example Time Series Model Structure 3
The results for the third data set exhibit a similar trend in the aspect that when a strong causal signal is present, it is correctly recognised. In case of lack of causality, or with very weak causal component, the distribution of the test statistics can be wide, but no spurious causality is detected. The data generated from model 3 has a long memory component, controlled by the parameter d ∈ [0, 0.5) , and one of the most interesting aspects is understanding the effect of long memory.
First of all, with the standard parameters, long memory hardly influences recognition of causality. Here, standard parameters are: strong causal component present ( b Y = 0.7, b Z = 0.7 ), and the noise variance is not substantial ( 2 n = 0.01). Figure 13 shows the distribution of test statistics when long memory is not present ( d = 0 ), and when the effect of long memory is strong ( d = 0.45 ), for different data lengths. The effect of changing parameters on the data generated from the model 3, in particular of changing the memory parameter d, is not significant. This seems unexpected at first, compared to the results of the simple test.
The explanation, however, lies in how the parameter estimation works, illustrated in the Fig. 14. The model is strongly misspecified and several properties of the data Y and for different experiments, all of length 1000. It can be seen that the estimates strongly increase with increasing d and MA, and that this pattern appears for all values of the noise variance are not well described by the model. Let us remember, though, that the long memory component has an infinite sum moving average representation, and the moving average model has an autoregressive representation. So the primary effect of increasing moving average part and the long memory part is the increase of parameters responsible for autoregression.

Comparison to Other Models
This section is provided to substantiate some of the claims we make about how our methods compare to existing methods. We provide three case studies, two of them compare our method to benchmark methods for causality: Granger causality and transfer entropy. The third case study compares our method to using generalised likelihood ratio test on a well specified econometric model (ARFIMA, example time series model class 3, Eqs. 24-26). What we show in our experiments is that our model achieves good results for all types of data, but in all cases, except for applying linear Granger causality test to linear causality, our method has superior asymptotic properties, as it reaches good power of the test for small samples.
Please note that in these case studies we concentrate on the ability to detect causality, and not on the time complexity of the algorithm.

Case Study 1: Granger Causality
Granger causality can be seen as the original, but also the simplest method of assessing statistical causality. For Gaussian noise and linear causal relationship, Granger causality is arguably the best method, given that the test statistics have known asymptotic distributions, and estimators have excellent numerical properties. What is more, Granger causality can perform well for a range of data that departs from the model assumptions.
In this, and in the next case study, we will use four data sets, designed to show the effect of the departure from the assumption of data with linear dependence, stationary distributions, and Gaussian noise (as introduced earlier in the Eq. 17), replicated below with slight modifications: The data model from Eq. 27 exhibits two causal relationships. The causal relationship X → Y is -if we assume Gaussian white noise -of the type that Granger causality has been designed to model: linear, stationary, with Gaussian distributions. We will call this a base case (set one), and we will consider three other cases, each presenting a departure from one of those three properties. The causal relationship Y → Z is not linear, and it forms the set 2. We will also consider what happens to the ability to detect relationship X → Y , if we changed Gaussian noise to t-student noise (set 3), and if we changed stationary to nonstationary marginal distributions (set 4; in this case we use polynomial covariance, please refer to the Table 1). These four set and their properties are summarised in the Table 3.
We present the results for the Granger causality method, using the GCCA toolbox. The test statistic used in the toolbox is the measure of linear feedback introduced by Geweke (1982), as in the Eq. 6. The corresponding test used for testing the null hypothesis of lack of causality is the F-test. The results are presented graphically in the Figs. 15 and 16. The results of using Granger causality can be summarised by two main observations. Firstly, for strong linear causality relationship, the linear Granger causality test is very robust and practical even if we do not observe Gaussian noise or stationary covariance. Secondly, for nonlinear causality, the linear Granger causality method behaves no better than a random guess, regardless of the data size. How does that compare to our method? The Fig. 16 shows that for strong, linear causality, our method is not as robust as linear Granger causality, and requires a bigger sample. However, our method can successfully detect nonlinear causality. For the data with t-distributed noise, we present results for the test statistic calculated by assuming the correctly specified model, and using an approximate method 3 .

Case Study 2: Transfer Entropy
We have used the same data structures as described in the Eqs. 27 together with the Table 3. The results are graphically shown in the Fig. 17.
Transfer entropy is a popular method used as a nonlinear extension of the linear Granger causality (for Gaussian distributions these two methods are equivalent). It is able to consider wider range of data types and relationships, however it is much more difficult to estimate. Compared to our method, transfer entropy requires much larger data samples, and at the same time it is not able to deal with model structures like long memory, non-stationarity, etc. Comparing Figs. 15 and 17 shows inferior performance of transfer entropy to our method in each of the four cases, and inferior to (linear) Granger causality in three cases. Transfer entropy is better than Granger causality in recognising nonlinear causality,  however, only for the sample of size 500 is transfer entropy performing recognisably better than a random choice. What is not shown in the results, but for the sake of fairness needs to be mentioned, is the fact that transfer entropy is much faster than our method, with the current implementation.

Case Study 3: ARFIMA Model
The data that was used for this example has been generated according to an ARFIMA (1,d,1) model with external regressors, Eqs. 24-26, can be represented in a form emphasising the autoregressive part (this is possible because we restricted the choice of d to (0, 0.5)):  We estimate data using modified MATLAB code ARFIMA-SIM by Fatichi (2009). For fitting the ARFIMA with external regressors we use the rugarch R library. We present results for nine parameter settings, which are listed in the Table 4 .
We present the results of using our causality method to estimate causality in Fig. 18, while the results of using a fully specified likelihood of the ARFIMA model are shown in the Fig. 19.
Our method is operating on the GP model representation, which is clearly misspecified. However, that does not prevent our model from detecting causality even for the smallest samples of length 20. That is not the case for using the well specified ARFIMA model and estimated likelihood -in this case a very large sample is needed for the estimation to even converge -data of length 1000 is required for the calculation of the results for all 9 data sets.

Real Data Experiments
In this section we apply the testing procedures to analyse commodity futures data.
In our analysis we use the following data: 1 and 36 month expiry oil futures contracts, obtained from futures curves built on the basis of West Texas Intermediate (WTI) Crude oil futures prices traded on the New York Mercantile Exchange, as described by Ames et al. (2016). The effect of the currency level, captured by the US Dollar Index DXY, is constructed as an index of USD relative to EUR, JPY, GBP, CAD, SEK, CHF. Thirdly, we also use a widely considered proxy for convenience yield based on a component related to transportation expense, given by the cost of freighting and short term Fig. 18 Distributions of test statistic for GPC method, shown for three lengths of the time series, and for 9 data sets Fig. 19 Distributions of test statistic for the AFRIMA likelihood method, shown for two lengths of the time series, and for 9 data sets storage, measured by the Baltic Dry Index (BDI), see Ames et al. (2016). There is a stochastic functional relationship between commodity futures contracts of different maturities (term structure) based on: spot price, convenience yield, interest rate, and dollar value. Convenience yield is very hard to model, but can be captured to some extent by BDI, and the interest rate can be proxied by the time value of money expressed by the futures contracts. Hence the choice of both long and short dated futures contracts for our analysis. The Fig. 20 shows the four covariates from 17th Jan 1990 to 23rd Dec 2015. For literature studying classical relationships between these data, we refer to: Ames et al. (2016), Bakshi et al. (2010) andDempster et al. (2012).

Interpreting Causal Relationships
The study performed here uses causality testing to demonstrate the risk factors that investors should consider in their decision process. It also shows how speculators in currency markets and futures markets have a propensity to respond to information observed at different lags and the time it takes them to re-adjust the expectations for futures market hedging or speculation in light of this information. Figures 21, 22, 23, 24 present the changing significance of causal relationships between the dates 17th Jan 1990 to 23rd Dec 2015. The four pairs that we look at, and the abbreviations that we will use are as follows: 1 month oil futures (1m WTI) and freighting/ storage index (BDI), 36 months oil futures (36m WTI) and freighting/ storage index, 1 month oil futures and dollar index (DXY), 36 months oil futures and dollar index. We are presenting causal reactions at two lags: one week, which can be seen as nearly instantaneous, and eight weeks. Figures 21-24 show charts smoothed with cubic spline smoothing, which makes it easier to observe the main trends, in particular in the case of lags of 8 weeks.
Markets learn from the news and facilitate them into the price, according to the efficient market hypothesis 4 , to which we subscribe (Fama (1970); Fama and French (1988); Campbell  Campbell et al. (1997);Malkiel (2003)). We want to learn which variables have effect on price formation, and at what time horizon. We also want to relate to the fact that the three different classes of investments (oil futures, currencies, physicals) have different investor profiles, and thus we expect a difference in the type and speed of reaction. The last question that interests us, is whether the results confirm the intuition that regimes affect the direction and significance of causal influence. The interplay between WTI oil futures and the cost of freighting (BDI) Market participants investing in freighting are likely to be interested in the ownership of the physical asset, therefore BDI can be used as a proxy for convenience yield. It is expected that the WTI oil futures will not have instantaneous effect on the BDI, which is confirmed by our analysis showing that the causal direction from WTI to BDI is generally not statistically significant at 1 lag (Figs. 21 and 22,top subplots).
The effect to which the WTI futures incorporate the BDI movements varies across maturities. Short contracts have not been reacting to BDI changes in 1 week, with the exception of 2008/2009, which was a reaction to crisis. Similar response can be seen for longer maturities, however for longer maturities we observe the BDI→36m WTI to be significant through late nineties.
At 8 lags, we observe that the causal effects are significant in both directions, majority of the time. This can be seen as markets being able to absorb the information and adjust the expectation. For the times when this relationship breaks, investors use other sources, to inform their long term perception of risk and expectations: for example as a result of the 2008 crisis investors across many markets were decreasing their exposure to risk. In late nineties, as well as in 2014, we can observe a divergence of reactions of BDI to short and long term oil futures at 8 week lags: this could be seen as investors using outside information to decide on their long term expectations: for example about advancement in methodology or legislation pertaining renewable energy. The interplay between WTI oil futures and the dollar index (DXY) The dollar index is a weighted geometric mean of the dollar's value relative to a basket of foreign currencies: Euro (EUR) 57.6% weight, Japanese yen (JPY) 13.6% weight, Pound sterling (GBP) 11.9% weight, Canadian dollar (CAD) 9.1% weight, Swedish krona (SEK) 4.2% weight, Swiss franc (CHF) 3.6% weight. The Canadian dollar is considered a commodity currency, while the Japanese yen is particularly sensitive to changes in oil prices due to Japan importing almost all of its oil. Therefore market expectations towards dollar index will incorporate to a large degree the expectations that arise from the oil market.
Following the results from the Fig. 24, there is evidence to suggest that DXY drives longer dated futures more strongly. At the same time, when comparing top charts from Figs. 22 and 24, we notice similarity in causal pattern between DXY → 36m WTI and BDI → 36m WTI, in particular during the nineties. This could suggest another direct or indirect factor, common for the two causal direction, for example general attitude to risk.
We look at Markov Switching Model, to analyse if DXY and BDI will have similar patterns of states for volatility, when explained with VIX. We use the following models: where: S t and S ′ t , which we assume to only take values 1 and 2, are the states at time t for DXY and BDI respectively, 2 D,S t , 2 B,S ′ t are the variances of the innovation at state S t , S ′ t , 1,S t , 1,S ′ t are the mean coefficients at state S t , S ′ t , and D t , B t are innovations. Figure 25 presents the conditional standard deviation of error term for regime switching models from Eq. 28 and 29, scaled for clarity to [0, 1] , and superimposed on the power of the tests of BDI → 36m WTI and DXY → 36m WTI, for 1 lag. First of all, for BDI it is the decreased conditional volatility that coincides with higher evidence of causality, while for DXY it is the increased volatility. However the persistence of high evidence for causality from 1996 to 2002 for both DXY → 36m WTI and BDI → 36m WTI, coincides with the persistence of one state for conditional standard deviation of respective covariates over that period of time. This suggests that the perception of market risk as seen via VIX is a common driving factor for during the nineties, a factor which can supersede other dependencies.

Influence of the Absolute Value of the Oil Prices On the Causal Structure
During the times when world oil prices are seen as high, it is more reasonable to expect investments in oil infrastructure as well as storage and transport. Therefore, we would expect that the absolute level of the oil price affects the behaviour (direction, strength, persistence) of causality. To test this, we compare the causal structure, as well as the fitted models, during the period of low prices: 17. 01.1990 -11.08.1999 (bellow $40), and period of high prices: 26.05.2004 -11.03.2009 (above $90). We will be interested in the relative difference between the fitted mean values, as well as the relative difference between hyperparameters (coefficients of the mean): autoregressive and causal. For that we will be using two sample mean test. Please note, that while we are particularly interested in the change of regime in the fitted models, we also check the regime change of the causal test statistics -this is because we were earlier making a point of being able to detect causality even in misspecified models! Lets assume that for each of the pairs: 1 m WTI and BDI, 36 m WTI and BDI, 1 m WTI and DXY, 36 m WTI and DXY, we take X t to denote one of the time series from the pair, and Y t -the other: with the usual notation. We denote M X t and M Y t as time series of values of the mean functions fitted by the models used for causality testing on rolling windows. Figure 26 shows the two segments of the fitted means: segment corresponding to prices below $40 and above $90, and in the Fig. 27 these have been additionally filtered according to the significance of the causal hypothesis. The mean function estimations are calculated on moving windows, with one mean function estimation equal to a mean of fitted values for the respective window.
For each of the pairs, we performed a two means test: We have run the popular student-t distribution two means test, as well as a two means test using sieve bootstrap to correct for serial dependence. The results are unanimously rejecting the hypotheses of equal means.  Fig. 28, which we contrast with the results for our framework in Fig. 29. These two approaches paint considerably different pictures for the causal relationships between the two time series. Fundamentally, at lag 1, the linear regression framework shows high confidence for the causality (precisely, for rejections of the hypothesis of lack of causality), which contrasts with GP framework rejecting lack of causality for very few data windows. We conjecture, that linear regression model is overconfident due to not being able to recognise nonlinear effects, in particular to remove excess serial correlation that would subsequently invalidate the assumptions of the hypothesis test resulting in excess kurtosis in the test statistic distribution and overly confident decision outcomes as a result. This is confirmed when analysing residuals of the linear regression fits. We demonstrate that for three specific point in time to show three scenarios where either one, or both of the directions show a high confidence for the linear model, that we observe with our framework. Figure 30 presents a series Quantile-Quantile (QQ) plots of empirical residual quantiles versus normal quantiles of the residuals for the linear regression models for testing causality, and relate to the three dates marked on the evolution of causal influence in Fig. 29. Linear regression for the window ending on 24th January 1998 (first row in Fig. 30), strongly suggests a causal direction from 1 month futures contract to the Baltic Dry Index for 1 lag, a relationship which our framework strongly rejects. But when we look at the residuals of the linear model, we see evidence of serial correlation and skewness, and this is arguably stronger than for the opposite direction for which linear regression model does not support the existence of causality. For window ending on 1st May 2002, linear regression results with residuals that exhibit very strong leptokurtic tails in both directions -and again our framework does not support the hypothesis of lack of causality here. Finally for the window ending on 21st January 2009 linear regression again does not sufficiently account for serial correlation, but in this case our framework rejects the hypothesis of lack of causality for the direction of BDI to 1 month WTI. Fig. 29 Evolution of the causal influence tested with the framework based on GPs: 1-pvalues of the test statistic for 1 months WTI and BDI, with 1 lags (top subplot) and 8 lags (bottom subplot), rolling window of 104 weeks and cubic spline smoothing Our conjecture of serial correlation in residuals leading to overconfidence of the linear model is supported by results that correct for such serial correlation. Figure 31 presents the result of testing for causality with a GP framework that a) incorporates linear trend from linear regression, and b) does not incorporate causal structure in the covariance, while the GP framework from Fig. 32 incorporates a) linear trend from the linear regression, b) allows for causality in covariance. Correcting for serial correlation removes some of the overconfidence of the linear regression model, which is then further reduced by also correcting for potential dependence in the covariance.
We conclude that while using linear regression models for testing causality can have higher power, this could be misleading, as the model could be overconfident due to incorrect statistical assumptions. Using GPs can not only help with these specific structural properties that we mentioned: serial correlation and causality in covariance, but it goes even further, by allowing to test for causality under a range of model assumptions without penalising model misspecification. Fig. 30 QQ plots of the residuals for the linear regression models for testing causality, for data windows ending on: data windows ending on 24th January 1998, 1st May 2002, and 21st January 2009 (rows). Each of the four columns of qq plots represent a combination of lag and direction of the causality Fig. 31 Evolution of the causal influence tested with the framework based on GPs with trend from linear regression and no causality in covariance: 1-pvalues of the test statistic for 1 months WTI and BDI, with 1 lags (top subplot) and 8 lags (bottom subplot), rolling window of 104 weeks and cubic spline smoothing

Conclusion
We demonstrated that our proposed testing frameworks for statistical causality in general classes of multivariate nonlinear time series models are statistically efficient in detecting a wide range of different causality structures in complex multivariate nonlinear time series structures. It accommodates flexible features where causality may be present in either: trend, volatility or both structural components of the multivariate time series considered.
The analysis of the power of the hypothesis tests shows that the framework not only behaves as expected but also has properties that make it practical. An important result in this paper is obtaining a test statistic with known asymptotic distribution, but what is even more important is that we do not need a very large sample to be able to use that result in practice. For simple tests -ones that use exact hyperparameters, and compound tests -where the hyperparameters are estimated, we look at popular tools for assessing the quality of a testing procedure: test statistic distribution, power of the test and the ROC Fig. 32 Evolution of the causal influence tested with the framework based on GPs with trend from linear regression and allowing for causality in covariance: 1-pvalues of the test statistic for 1 months WTI and BDI, with 1 lags (top subplot) and 8 lags (bottom subplot), rolling window of 104 weeks and cubic spline smoothing curves. Furthermore, we compare our approach to Granger causality and transfer entropy -typical benchmarks for testing causality, and we conclude that our approach is practical in all cases, but offers superior performance especially for time series with long memory. Finally, we offer an example of real data application to analysing risk factors that investors should consider when building a portfolio of oil futures, currencies and physicals.

Calculating Marginal Likelihood
In the Eq. 14 we defined the causality metrics L X→Y|Z as a logarithm of a likelihood ratio we have: The two log-likelihood terms can be obtained in an analogous way, so lets show it for the model B. The Eq. 10 was defining t as ) . If we denote as vector of all Y t 's, use B and B and B then we get the following distributions: Which are combined to calculate the marginal likelihood: and that gives (for example by brute force and completing the squares) the following marginal likelihood: The marginal log-likelihood (or log marginal likelihood) is therefore equal: What we need for the causality metrics in the Eq. 14 is the maximum marginal loglikelihood, which we can neatly achieve from Rasmussen and Williams (2006)

Receiver Operating Characteristic (ROC)
Receiver operating characteristic (ROC) curves are commonly used in classification models to quantify the accuracy with which a model can discriminate between two classes. If we called one class as containing positive cases and the other -negative cases, then if the model correctly classifies, it will produce "true positive" and "true negative" labels, and if it incorrectly classifies, it will produce "false positive" and "false negative" cases. The ROC curve plots True Positive Rate (TPR or Sensitivity) versus False Positive Rate (FPR or 1-Specificity), for a range of thresholds T: In the case of continuous variables, as we have been dealing with, the classification rule is based on the test statistics being above / below a threshold, or it is cumulative distribution being above / below appropriate threshold. Recall the Sect. 4.1.2 and the asymptotic result for the distribution of the test statistics 2L X→Y|Z ∼ 2 2k . We were mentioning that the intuition is that large values of the causality "metrics" L X→Y|Z coincide with causality, while lower -with lack of causality. So our classification rule could be to reject the hypothesis H 0 of no causality if L X→Y|Z > 0.5T ⇔ 2L X→Y|Z > T and accordingly F 2 q (2L X→Y|Z ) > 1 − for some related significance value of .
Then let f 0 be a cdf of 2 q , while f 1 be a cdf of non-central 2 q . Then: The coordinates of the ROC curve are (FPR(T), TPR(T)), which leads to parametrisation: See Zou et al. (2007); Hillis and Metz (2012).

Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.