Causal coupling inference from multivariate time series based on ordinal partition transition networks

Identifying causal relationships is a challenging yet crucial problem in many fields of science like epidemiology, climatology, ecology, genomics, economics and neuroscience, to mention only a few. Recent studies have demonstrated that ordinal partition transition networks (OPTNs) allow inferring the coupling direction between two dynamical systems. In this work, we generalize this concept to the study of the interactions among multiple dynamical systems and we propose a new method to detect causality in multivariate observational data. By applying this method to numerical simulations of coupled linear stochastic processes as well as two examples of interacting nonlinear dynamical systems (coupled Lorenz systems and a network of neural mass models), we demonstrate that our approach can reliably identify the direction of interactions and the associated coupling delays. Finally, we study real-world observational microelectrode array electrophysiology data from rodent brain slices to identify the causal coupling structures underlying epileptiform activity. Our results, both from simulations and real-world data, suggest that OPTNs can provide a complementary and robust approach to infer causal effect networks from multivariate observational data.


Introduction
The detection of causal interactions is a fundamental problem in both natural and social sciences [1,2]. Reliable statistical inference of causality can aid in making predictions, which may eventually allow to design proper intervention strategies [3]. For example, in neurological disorders such as epilepsy, predicting a seizure from electroencephalography (EEG) recordings and preventing its occurrence is a long-standing problem that has not yet been completely solved. In this context, reliable seizure prediction algorithms and consequent appropriate interventions to prevent seizure occurrence could be life-saving for the patient.
In the past few decades, several approaches have been developed to both identify and quantify the interdependence between observational time series. The Granger causality test can be used to infer causality between two time series [4], and its extension, partial directed coherence, allows to infer causality from multivariate data [5]. The Granger causality test is based on linear regression. Under this framework, a causal relation of one system (or variable) X affecting another system Y (i.e., X → Y ) is statistically inferred if the variance of the prediction error of the behavior of Y can be reduced by including past information about observations of X in the regression model for Y . Since the classical Granger causality analysis is model-based and (in a strict sense) only valid for linear systems, more general bivariate approaches based on information theory have been proposed for identifying causality in applications to nonlinear dynamical systems [6]. These methods include, among others, transfer entropy [7], time-delayed mutual information [8], and the multivariate extension of transfer entropy [6]. In particular, transfer entropy can be considered as a generalization of Granger causality for nonlinear systems [9], while it has been shown to be equivalent to Granger causality for linear Gaussian models [10].
With the rising availability of powerful computer infrastructures and large observational data sets, the study of statistical interdependencies based on multivariate time series has found widespread applications, for instance, in the area of neuroscience. In the context of multi-channel EEG recordings, such interdependencies can be related to the concept of functional connectivity, which commonly refers to the statistical associations between pairs of signals that are measured in terms of simple linear correlations or variants thereof and, thus, cannot distinguish between direct and indirect connectivity [11]. Several statistical association measures have been employed to estimate functional connectivity from EEG data, including Pearson correlation, mutual information, synchronization likelihood, and phase locking value, all of which are symmetric measures and do not provide information regarding the direction of the associated information flow [8,[12][13][14][15].
While studying complex systems such as neuronal networks, it is often important to identify not only the symmetric statistical associations, but also the causal relationships (i.e. driver-response relationships) between the involved sub-systems [2]. For example, effective connectivity [16] between individual neurons (or ensembles of neurons) is characterized by direct (causal) relationships. Hence, statistical inference of causality based on time series is a more informative approach to better understand the interplay between neuronal connectivity and dynamics, and is the key to study the structure-function relationship of neuronal networks [17]. To estimate effective connectiv-ity, recent works using Granger causality-based methods such as directed coherence and transfer entropy have been proposed [18]. Although these concepts provide asymmetric measures, they cannot distinguish between causal (direct) and non-causal (indirect) interdependence among two subsystems. By contrast, partial directed coherence [5] (which has been often used in EEG studies) can distinguish between direct and indirect causal links but assumes a simple linear multivariate autoregressive model for describing the data. More recently, Bayesian filtering approaches have been suggested to estimate connectivity. However, such approaches also make strong assumptions about the underlying dynamical model [19][20][21]. Another approach known as dynamic causal modeling [22] has also been widely used to estimate causality among brain regions using EEG, magnetoencephalography or functional magnetic resonance imaging measurements. However, dynamic causal modeling is also a modelbased approach and makes strong assumptions on the process by which the data are generated. In addition, it requires the pre-specification of several competing hypotheses, which may not always be available. For a detailed review on various methods for estimating neural connectivity and their applications in neuroscience, we further refer to [23,24] and the references therein.
In the last years, a great deal of interest has emerged in characterizing dynamical systems using complex network-based time series analysis methods [25]. Those methods include, among others, recurrence networks [26][27][28], visibility graphs, [29] and transition networks [30], all of which feature different definitions of nodes and links in the resulting network representations of the time series under study. For instance, in the case of recurrence networks, the edges are defined based on the proximity of observed states in phase space, whereas for visibility graphs [26], mutually visible elements in a univariate time series are linked to form a network [29]. Finally, in the case of transition networks [30], certain discrete states or patterns are defined as nodes, and if one of these dynamical structures is followed by the other with nonzero probability along the observed (or reconstructed) trajectory, a directed edge is established between the corresponding nodes [31]. However, the majority of previous applications of the aforementioned methods have focused on univariate time series, while a generalization to multivariate time series would be a necessary step to allow detecting signatures of causality. This aspect has not yet been systematically explored in full detail in the recent literature.
In this work, we consider a particular class of transition networks known as ordinal partition transition networks (OPTNs) [32] to infer causality from multivariate time series. OPTNs are based on the ordinal patterns embedded in a time series, the systematic analysis of which has originally been proposed in [33]. Such ordinal patterns reflect the respective rank order among a predefined sequence of univariate observation values. Identifying the series of subsequent ordinal patterns in a univariate time series results in a particular symbolic representation of the observed system's trajectory. It has been shown that under specific conditions, this ordinal partition exhibits the generating property, which makes it attractive for applications since it implies topological conjugacy between phase space and the ordinal symbolic dynamics [30]. For an unconstrained stochastic process, all possible ordinal patterns occur with equal probability. By contrast, for a time series produced by deterministic dynamics, certain ordinal patterns commonly do not appear, which are known as forbidden patterns [34] and provide a window of opportunity to test for possible determinism in a time series [35][36][37]. Furthermore, given the symbolic representation of the phase space trajectory (which controls the respective frequency of the different ordinal patterns), it is possible to compute a variety of dynamical characteristics, such as permutation entropy or a plethora of statistical complexity measures that can be defined based on the latter concept. Taken together, ordinal patternbased analysis offers several advantages as compared to other more traditional nonlinear time series analysis techniques. The resulting methods are conceptually simple, computationally fast, and can capture information about the short-range temporal structure of the underlying time series [38]. In addition, they have been shown to be robust against additive noise. Finally, the calculation of ordinal patterns does not require any a priori knowledge of the data range, which is practical and advantageous in time series analysis [38].
An OPTN is based on the ordinal symbolic encoding of a time series and consists of [31] (1) nodes, which represent the individual ordinal patterns and (2) probability-weighted edges, which represent the transition frequencies between two successive ordinal pat-terns. Previous applications of statistical complexity measures derived from OPTNs include the classification of cardiac dynamics based on electrocardiography data [30,39] and the analysis of EEG data from healthy and epileptic humans [40]. Although recent attempts have addressed bi-and multivariate extensions of OPTNs [34], often with the aim of characterizing different types of synchronization transitions, they have not yet provided thorough information about causal relationships among multivariate time series. Most recently, Ruan et al. [31] have proposed a strategy for the estimation of several complexity measures based upon bipartite OPTNs, which allows for statistical inference of the coupling direction among paired time series. However, their approach is limited by its bivariate nature. Hence, when applied to a multivariate data set, it cannot distinguish between direct and indirect causal connections among the individual time series.
To overcome the aforementioned limitations of previous OPTN-based methods, in this work we propose an extension of OPTN-based time series analysis, which leverages the construction of multiple bipartite OPTNs (M-OPTN) to account for multivariate (i.e., comprising more than two components) time series. Specifically, we outline and thoroughly test an approach for distinguishing direct from indirect causal connections based on the conditional Shannon entropies of the bipartite constituents of the M-OPTN [31]. In order to demonstrate the effectiveness of our approach, we apply the proposed method to coupled linear stochastic processes, nonlinear dynamical systems (exemplified by three interacting Lorenz systems), and a network of coupled neural mass models. The latter type of system has been shown previously to mimic the dynamics exhibited by neurophysiological time series [41] and thus serves as a validation tool to test the applicability of our method to neuronal time series. Finally, as a real-world example, we study in vitro microelectrode array (MEA) recordings from an in vitro model of acute ictogenesis, i.e., rodent brain slices in which epileptiform discharges are induced by pharmacological treatment. The network interactions and the associated delays of epileptiform discharges propagation that occur in this in vitro model have been extensively characterized [45,46] and serve as a reliable reference to validate the causal network relationships and delays estimated by the proposed method.

Ordinal partition transition networks
Given a univariate time series X = {x t } T t=1 , following Takens' embedding theorem, we can qualitatively reconstruct the underlying phase space trajectory by using M successively lagged replications of X , each separated by a lag d, yielding the vector for t = 1 to T − (M − 1)d, where M and d are the embedding dimension and delay, respectively. Each embedding vector z t is mapped to a sequence of integers (s 0 , s 1 , . . . , s M−1 ) that describes the rank order of its components (with 0 indicating the smallest value) and is a unique permutation of the set {0, 1, . . . , M −1}, thereby satisfying and Note that there exist M! different possible ordinal patterns when a time series is embedded in M dimensions, and we denote these patterns by π 1 , π 2 , . . . , π M! . As an example, consider a 5-dimensional embedding of a time series yielding an embedding vector {x t , x t+d , x t+2d , x t+3d , x t+4d } = {3, 9, 10, 1, 6}. (4) Here, x t+3d < x t < x t+4d < x t+d < x t+2d , and thus this partition would be mapped to the ordinal pattern or symbol π k = {3, 0, 4, 1, 2}. The exact numerical value of the resulting integer index k ∈ {1, . . . , M!} depends on the specific sorting of the permutations, the default of which may differ among different algorithms and programming languages.
For a univariate time series, we can then construct an (unweighted or weighted) OPTN with M! nodes by first repeating this encoding procedure for each embedding vector. A weighted OPTN is obtained by setting the weight of the edge between two nodes (permutations) to be equal to the empirical frequency of "transitions" (i.e., successive occurrences) between the corresponding possible ordinal patterns. An unweighted OPTN simply contains a directed edge of unit weight between the corresponding nodes if this frequency is nonzero.
Ruan et al. recently extended the idea of OPTNs to bivariate time series, that may be interacting either linearly or non-linearly [31]. In their framework, given the time series {x 1,t } T t=1 and {x 2,t } T t=1 , derived from two dynamical systems X 1 and X 2 , we can derive the associated sequences of ordinal patterns underlying each time series as described above, containing the ordinal patterns π x 1 i and π x 2 j for X 1 and X 2 , respectively. One can now compute the (instantaneous or time-lagged) conditional co-occurrence frequencies p(π x 2 j |π x 1 ,τ i ) by simply counting the number of cases in which π x 2 j occurs with a time-lag of τ following an occurrence of π x 1 i . Note that we deviate here from the notation in [31] in order to allow a straightforward generalization of the concept of co-occurrence probabilities by including multiple variables X i that may possibly cause variations of the given X 2 at different delays τ i (see Section 2.2 below).Thus, τ = 0 corresponds to looking at simultaneous co-occurrence while τ > 0 would imply looking at lagged co-occurrence. Given these conditional co-occurrence frequencies, Ruan et al. proposed the estimation of the conditional entropy (CE) [31], given as which gives the interaction X 1 → X 2 at a delay of τ . The interaction in the other direction at lag τ , X 2 → X 1 can be defined in an analogous way as follows, If X 1 and X 2 are independent and their different ordinal patterns uniformly distributed, then p(π x 2 ,τ 2 and p(π x 1 i |π x 2 ,τ j ) = 1 M! and thus H τ (X 1 |X 2 ) = log 2 M!, which is the upper bound for the CE value, denoted as H max . On the other hand, if X 1 and X 2 are fully dependent, then p(π x 1 i |π x 2 ,τ j ) = 1 and ideally H τ (X 1 |X 2 ) = 0. Thus, as the strength of causal interaction from X 2 to X 1 at a given time lag τ increases, H τ (X 1 |X 2 ) decreases.

Causal inference strategy based on entropy measures from OPTNs
When dealing with multivariate data, i.e. data from three or more interacting systems, it is necessary to dis-tinguish direct links from indirect ones. For example, consider a transitive chain X 1 → X 2 → X 3 ( Fig. 1  (a)). Applying the bivariate OPTN derived entropy measure as described in [31] would lead to the detection of a non-existing connection from X 1 → X 3 , as shown in Fig. 2 (a). Similarly, consider a fork pattern, where X 1 → X 2 and X 1 → X 3 ( Fig. 1 (b)). Here, X 1 is a common driver to both X 2 and X 3 and this will lead to a spurious connection between the X 2 and X 3 as shown in Fig. 2 (b). Note that in Fig. 2, an interaction m → n at negative delay is to be interpreted as n → m.
In order to distinguish such direct from indirect links, a sophisticated way of conditioning has to be employed, thereby generalizing the previous strictly bivariate approach. In the following, we will detail a possible methodology to use OPTN-based entropy measures to infer causality from multivariate time series as outlined in Fig. 3. Note that this methodology rests upon certain general assumptions common to causal inference methods, most notably, the completeness of the set of variables analyzed (i.e. the absence of any possible hidden drivers).

STEP I : Pre-processing of observational data
Depending on the particular research problem, observational time series first have to be pre-processed, which includes standard procedures like band-pass filtering, resampling and removal of any noise or artifacts if present. If necessary, the data can be further divided into a number of overlapping windows to obtain a timedependent estimate of the coupling measure.

STEP II : Construction of M-OPTNs
Given a time series, its phase space is reconstructed following Takens' embedding theorem. The components of the resulting embedding vectors are then rank ordered to obtain the symbolic representation of the time series. For an N -channel multivariate time series, N such OPTNs are constructed, are referred to as M-OPTNs.

STEP III : CE from M-OPTN
After constructing the M-OPTN, we compute the bivariate information theoretic measure of CE as given in Eq. (5) for each pair of variables and define the matrix Algorithm 1 Construct M-OPTN from multivariate time series 1: procedure ComputeM-OPTN 2: Input: Multivariate time series X 1 , X 2 , . . . , X N , embedding dimension M, and lag d.

3:
Output: M-OPTN Π 4: for n = 1 to N do 5: Map z n,t to symbol (s 0 , . . . , s M−1 ) using Equations (2) to (3). 7: Assign ordinal patterns Π[t, n] = π Xn k based on the respective permutation for each t and n. 8: end for 9: end for 10: end procedure where each off-diagonal term H τ (X n |X m ) represents a possible causal link from the m − th time series to the n − th time series, i.e., X m → X n , at delay τ .
Note that in what follows, we will not make use of the diagonal elements of H τ (representing the classical Shannon entropies -i.e. in our specific case the permutation entropies -of the individual processes), so that they could be safely ignored or just put to zero. Let H = {H τ 1 , H τ 2 , . . . , H τ J } denote the CE matrices obtained from the M-OPTN for a range of J delays T = {τ 1 , . . . , τ J }. Each of the matrices H τ obtained above is next thresholded using a hard threshold to obtain a new matrixĤ τ with elementŝ where H max = log 2 M! and λ is usually set between 0.99 and 1. Due to the finite sample size, two independent processes will not have a CE value exactly equal to H max . To account for this numerical issue, we allow for some tolerance by setting the parameter λ. After this step, only dominant neighbors (both causal and non-causal) are retained for each node. The resulting matrixĤ τ represents a weighted, directed network of N

3:
Output: CE matrixĤ 4: for i = 1 to J do 5: for n = 1 to N do 6: for m = 1 to N do 7: Compute H τ i (X n |X m ) using Equation (5)  Threshold H using λ and H max as shown in Equation (8). 13: end procedure the different component processes) common to the different layers (which represent the different delays τ j ), while E = {E 1 , . . . , E J } is a set of edges that will commonly differ among the layers. For every node X m in the multi-layer network defined byĤ, we identify a set of k m parents P X m = {p 1 , . . . , p k } at delays {τ p 1 , . . . , τ p km } and l m children C X m = {c 1 , . . . , c l } at delays {τ c 1 , . . . , τ c lm }, respectively, which can span across all possible layers. The set of parents for a node X m is given by where h m jτ describes the element in row m and column j of the single-layer adjacency matrixĤ τ , cf. Equation (8). In a completely analogous way, we define the set of children of X n node as where h inτ represents the element in row i and column n ofĤ τ . By defining those two sets for all nodes (variables) X m (X n ), we collapse the information contained in the multi-layer adjacency matrixĤ to the essential strong bivariate (time-lagged) linkages among the set of considered variables. for n = 1 to N do 5: Compute P Xn and C Xn using Equation (9) and (10). 6: end for 7: end procedure STEP V : Identifying the minimal set of neighbors for conditioning Given at set of parents for node X m , P X m = {p 1 , . . . , p k } at delays {τ p 1 , . . . , τ p k }, to test for a causal connection from node X n to node X m , we seek to define a minimal conditioning set P min X m ⊂ P X m , given as where C X n is the set of children of node X n , which does not include X m , i.e. C X n = C X n \X m . The set C X n will be an empty set if C X n = {X m }, i.e. the only child of node X n is node X m , and in this case we set P min X m = {X m }. Also, it is possible that P min X m will be an empty set due to no common elements between P X m and C X n . In this case, we set P min X m = P X m ∩ P X n . If P min X m is still an empty set, then we set P min X m = {X m }. Note that in each of these cases, along with the conditioning set P min X m , we also obtain the corresponding delays, {τ 1 , . . . , τ r }, with |P min X m | = r . In order to facilitate reliable computation of CE (see STEP VI) to exclude non-causal neighbors, in all our applications we restrict ourselves here to r = 3 if r > 4, by choosing the three most dominant neighbors, based on their respective CE values.

3:
Output: P min Xm 4: for m = 1 to M do 5: for n = 1 to N do 6:

STEP VI : Removal of non-causal neighbors by proper conditioning
To check if X n is a truly causal parent to X m , we compute If X n is an indirect causal connection to X m , then X n = 0, since conditioning on X n should not reduce H (X m |P min X m ) any further. However, since we are dealing with finite data, X n ≈ 0. In practice, we set another pre-defined threshold δ and if X n < δ, then X n is considered as an indirect causal link to X m . The CE H (X m |P min X m ), with |P min X m | = r , is given as The CE H (X m |P min X m , X n ) is defined in an analogous way.

Numerical examples
In the following, we present results from the application of the proposed methodology to simulations of linearly interacting stochastic processes as well as interacting nonlinear dynamical systems including Lorenz systems and a network of neural mass models. In the case of interacting stochastic processes and interacting Lorenz system, we varied δ from 0 to 0.5. We also added observational noise to the simulated data, where e(t) ∼ βN (0, 1), where β is the noise level (NL), which is set at 0.1, 0.2 and 0.4 times the standard deviation of original noise free time series.
In the case of the network of neural mass models, we varied δ from 0 to 0.25 and added noise at the level of 0.5 and 1 times the standard deviation of original noise-free time series, which corresponds to realistic signal-tonoise ratios commonly found in EEG data (Amplitude signal-to-noise ratios of 2 and 1, respectively).
For evaluating the results of our methodology for the different simulations, we define the number of true positives (#TP) as the number of correctly identified links, and the number of false negatives (#FN) as the number of missed links. The number of false positives (#FP) is defined as the number of incorrectly identified links, and the number of true negatives (#TN) corresponds to the number of correctly identified non-links. We then define the true positive rate (TPR) and false positive rate (FPR) as In addition, we use the F 1 -score to quantify the accuracy of the method, which is given as

Interacting stochastic processes
We first simulated the following multivariate autoregressive system such that it contains a directed chain as well as a fork, both of which can lead to spurious causal links: The causal structure of the system described above is shown in Fig. 4. We vary the threshold δ from 0 to 0.5, and for each value of δ, we generate 50 realizations for the system described in Eq. (17). We varied the range of delays from 1 to 10 and the embedding dimension and delay were set to 3 and 100, respectively. We performed the simulations for generating data sets of a size of T = 1000, 5000, 10000 and 20000 samples. Figure 5 shows the results from an exemplary simulation, where δ = 0.15, N = 10000 samples and N L = 0. As we can see from the figure, the causal interactions among the stochastic processes are correctly identified along with their respective delays. In Fig. 5 the interaction m → n at a negative delay is again to be interpreted as n → m. The causal interaction between two processes at a particular delay results in a drop in the CE value away from the H max value, which in our case is given by log M! ≈ 2.58. Figures 6 and 7 show the T P R and F P R values, respectively, for the interacting stochastic system, as δ, T and N L are varied. We can see that at T = 1000, high T P R as well as F P R values are obtained, with no significant changes in their values as N L or δ is varied. In case of the T P R values (see Fig. 6), we find that for T ≥ 5000, the δ value at which T P R starts to drop below 1 slightly decreases as N L increases, but in all cases, T P R = 1 for δ < 0.2. For δ > 0.3, the T P R  Fig. 7). Figure 8 shows the accuracy of the causal inference algorithm in terms of the F 1 -score for the interacting stochastic system. We observe that for T = 1000, F 1 = 0, irrespective of the choice of δ or the level of the noise. For T = 5000, we find that F 1 tends to increase as δ is increased and reaches ≈ 0.9 only when δ > 0.2.

Interacting Lorenz systems
The next example of three interacting identical Lorenz systems with the structure X 1 → X 2 → X 3 is defined by the following set of ordinary differential equations, We have set the coupling strength to c = 0.6 and used a Runge-Kutta integrator with a step size of dt = 0.001 to numerically solve the above set of ordinary differential equations. We then used the time series {x k (t)} T t=1 with k = 1, 2, 3 as the observations from the interacting Lorenz systems, to which we added noise at N L = 0.1, 0.2, 0.4. We set λ = 0.995 and performed 50 simulations for every combination of the threshold δ, time series length T and noise level N L including the noise-free condition (i.e., N L = 0). Figures 9 and 10 show the T P R and F P R values obtained for the interacting Lorenz systems as δ and T is varied for different values of N L. Here N L = 0 represents the noise-free condition. We can see that for T = 1000, the proposed causal inference algorithm gives high T P R (≈ 1) as well as F P R (≈ 0.60), which starts to drop for δ > 0.15 and δ > 0.1, respectively, for N L = 0. As N L is increased, the δ at which T P R and F P R start to drop, also increases. However, we observe that for all values of N L, for T = 1000, the accuracy of the algorithm as given by the F 1 score (see Fig. 11) is the highest at δ ≈ 0.2 for N L = 0 and for other values of N L, F 1 remains mostly at 0.5.
For T = 5000, the T P R remains at 1 for N L = 0, 0.10 and 0.20 and starts to drop for δ > 0.1 (see Fig.  9). Only in the case of N L = 0, the F P R drops to ≈ 0.20 at δ ≈ 0.08 (see Fig. 9). For N L = 0.10 and 0.20, FPR remains at 0.6 and only starts to drop when δ > 0.1. For N L = 0.40, we observe low F P R (< 0.4) as well as T P R (< 0.5) values for all values of δ. Figure 11 shows that for T = 5000, at N L = 0.0 and δ ≈ 0.15, the F 1 -score is ≈ 0.78 and starts to drop to 0 and δ is increased. The F 1 -score remains mostly at about 0.5 and drops to 0 for other values N L as δ is increased. From Fig. 9, for T = 10000 and 20000, we observe that T P R remains at 1 and the δ value at which it starts to drop and eventually reaches zero decreases as N L increases. However, we find T P R > 0.8 even with the addition of noise at N L = 0.1 and 0.2, for δ < 0.05, for which we also observe that the F P R starts to drop (see Fig. 10). This is also reflected in the accuracy as given by the F 1 -score shown in Fig. 11, where F 1 ≈ 0.88 for N L = 0 and δ = 0.1, and F 1 ≈ 0.88 for N L = 0.1 and δ ≈ 0.05. As N L is increased to 0.20, F 1 -score drops to 0.6. At N L = 0.40, we observe that the accuracy is 0 for all values of δ, except for very small values (< 0.03), for which F 1 ≈ 0.4.

Network of neural mass models
The simulations described in Sects. 3.1 and 3.2 have been restricted to linear stochastic systems and paradigmatic nonlinear dynamical systems, both of which may not fully characterize the typical nonlinear characteristics in neural time series. In order to also demonstrate the ability of the proposed method to capture interactions in nonlinear dynamical systems such as neuronal networks, we finally consider a network of neural mass models [41]. To this end, we created a network of eight neural mass models (the ordinary differential equations describing each neural mass model are provided in the Appendix, and the parameters are set as given in [41]), with the % of directed interactions (K) between the eight regions varying as 5%, 10% and 25% of the overall possible connections (N 2 including N self-connections), at a delay of 40 milliseconds. We vary the threshold λ from 0.99 to 1.0 and the threshold δ from 0 to 0.2. To also investigate the effect of noise on the performance of the method -in addition to the noise free observations from neural mass models, Gaussian noise at a N L of 0.5 and 1.0 was added to the output of the neural mass models. For each of the combinations of these parameters (K, λ, δ, and N L), we generated 25 simulations. The embedding parameters were the same as in Sect. 3.1 (M = 3 and d = 1). We computed CE based on M-OPTNs for delays ranging from 10 milliseconds to 100 milliseconds. Since the interaction in the simulated network happens at around 40 milliseconds, any interaction in the estimated networks at a delay other than 40 milliseconds is counted as a false positive.   Figure 12 shows the T P R values for the network of neural mass models for varying N L, λ, δ and K. We can see that for all values of K and N L = 0 and 0.5, the value of δ at which T P R remains at 1 decreases and T P R values are largely unaffected by the choice of λ. At N L = 1.0, we observe that as λ is increased, the range of δ for which T P R = 1 increases and at higher connection densities (K = 0.25), we observe that high T P R values are only observed for a narrow range of λ > 0.995 and δ values. Figure 13 shows the associated F P R values. We observe that for δ < 0.03, F P R remains very high and close to 1 irrespective of the choice of δ, λ or K when the N L = 0 or 0.5. When N L = 1.0, high values of F P R are observed for λ > 0.998. When δ > 0.05 and N L = 0 or 0.5, the F P R values drop to zero irrespective of the choice of λ and for all K. When N L is increased to 1.0, we observe in general lower values of F P R as λ is decreased. However for δ > 0.05, F P R drops to zero irrespective of the choice of λ or K.
The accuracy of the proposed approach for the network of neural mass models is finally shown in terms of the F 1 -score in Fig. 14. As it is evident from the figure, as N L and K increases, the choice of λ and δ for which we obtain high values for F 1 gets narrower. For example, at N L = 0 and K = 0.1, for 0.1 < δ < 0.2, we can see that F 1 ≈ 1 irrespective of the choice of λ.
But for the same K, we observe that in order to get good accuracy, i.e., high F 1 -score, we need to choose lower values of λ and δ as N L is increased. When K = 0.25, and N L <= 0.5, we observe that F 1 ≈ 0.7 for certain choices of λ and δ but when N L = 1.0, F 1 < 0.5.

Causality detection in MEA electrophysiology data
To validate our approach against real-world experimental data, we have applied the developed algorithm to electrophysiological recordings of epileptiform patterns generated by 4-aminopyridine (4AP)treated rodent hippocampus-cortex (CTX) slices. In order to visualize the network activity, we transform the CE values obtained from two MEA signals i and j, H (i| j) as and normalize them such that the strongest pairwise interaction takes a value of 1, i.e.,S = S(i, j)/ max(S).

MEA recording and signal pre-processing
Individual brain slices were placed on a 6 × 10 planar MEA (TiN electrodes, diameter 30 μm, inter-electrode distance 500 μm, impedance < 100 kΩ), held in place by a custom-made anchor, and continuously perfused at ≈ 1 ml/min with 4AP-ACSF at (≈ 32 • C), equilibrated with carbogen gas mixture. To allow for laminar flow and a high exchange rate of the 4AP-ACSF, a custom-made low-volume (≈ 500μl) recording chamber (Crisel Instruments, Italy) replaced the default MEA ring [42]. Extracellular field potentials were acquired at 5 kHz (pre-sampling low-pass filter at 2 kHz) using the MEA2100-mini-HS60 system through the Multichannel Experimenter software (all from Multichannel Sys- For the implementation of the M-OPTN, signals were pre-processed by low-pass filtering (1 kHz) and resampling at 3 kHz. The signals were then further divided into overlapping (50%), 4 s windows, to obtain time-varying measures of causality.

Epileptiform activity generated by 4AP-treated hippocampus-CTX slices
The brain slice preparation used in this work includes the fundamental circuits involved in the generation of limbic seizures seen in temporal lobe epilepsy and enables analyzing the network interactions leading to seizure-like discharge generation. As shown in Fig. 15, the key regions of interest (ROIs) in this brain slice preparation are the dentate gyrus (DG), the hippocampal subfields Cornu Ammonis 3 and 1 (CA3 and CA1, respectively), the subiculum (SUB) and the parahippocampal cortex (CTX-1 and CTX-2). These regions communicate through the so-called hippocampal loop [44] (see Fig. 15 (A)). When challenged with convulsant drugs, such as 4AP, hippocampus-CTX slices generate a typical epileptiform pattern made of three types of activity [45]: (1) slow interictal events, recurring at 5-20 s interval, generated by and spreading to any ROI with no specific site of origin, (2) fast interictal events, recurring at 0.5-2 s interval, generated specifically by the CA3, propagating to the CA1 via the Schaffer Collaterals and subsequently reaching the CTX through the SUB (output gate), (3) ictal (seizure-like) discharges, recurring at 3-5 min interval, originating primarily in the CTX and spreading to the hippocampus proper via the DG (input filter). It has been previously demonstrated that when the hippocampal loop circuitry is intact (connected brain slice), the fast CA3-driven interictal activity controls ictal discharge generation by the CTX, for which ictal discharges disappear within 1-2 hours of 4AP application, while only the interictal patterns remain. At variance, the disruption of the hippocampal loop upon Schaffer Collaterals damage, as seen in hippocampal sclerosis typical of temporal lobe epilepsy (disconnected brain slice) releases the CTX from the CA3 control permitting ictal activity generation and propagation [46]. Here, we have analyzed MEA recordings of epileptiform activity generated by disconnected hippocampus-CTX brain slices, in which the Schaffer Collaterals were mechanically severed.
The circuit diagram of a disconnected hippocampus-CTX slice is depicted in Fig. 15 (a) and 15 (b) shows a disconnected hippocampus-CTX slice placed on a . c MEA recording of the epileptiform pattern generated by the brain slice in B. The pattern consists of short interictal events (dots) and prolonged ictal discharges (solid line). Note that the small-amplitude events in CA1, SUB and CTX are far fields originating in CA3 6 × 10 planar MEA, while Fig. 15 (c) shows the typical epileptiform pattern induced by 4AP in this brain slice preparation, consisting of brief interictal events and prolonged ictal discharges.
For the purpose of this study, we have selected the signals from six electrodes in each brain slice, to include each of the four hippocampal regions and two CTX locations, one proximal and one distal to the hippocampus with regards to the signal propagation pathway (see Fig. 15 (b)). For each MEA recording, we have selected a portion of the signal to include ictal activity preceded and followed by 30 − 100 s of interictal activity.
The CE based on M-OPTN was computed for each window and the resulting CE value was assigned to the mid-point of each window, to obtain a time-varying measure.

Results
The results from the application of our method are shown in Fig. 16, where the plots on the main diagonal show the MEA signals acquired from the six selected ROIs. The off-diagonal plots represent the time-varying interaction between two ROIs across varying time delays (20 ms to 120 ms). From Fig. 16, it can be seen that significant network activity starts around the time of the ictal onset, with the strongest connections following the propagation paths CTX → DG, CTX → CA1, CTX → SUB, SUB → CTX and DG → CA3. The strongest interactions are observed at 38 and 70 ms.
The results from the entire data sets are qualitatively similar (see Figs. 17, 18 and 19 in Sect. 7 for the results obtained from the other three brain slices). The propagation from SUB → DG was also observed in three out of four slices. The interaction SUB → CTX was observed in all slices although the strength of interaction from SUB → CTX was found to be generally weaker compared to the interaction CTX → SUB. In general, the results show that outward connections from CTX and DG are generally the strongest during the ictal event and that these interactions appear to be strongest at a delay of ≈ 38 ms.

Discussion
In this work, we have proposed a new method to detect causality from multivariate observational data by computing information theoretic measures such as CE upon    For reliable computation of CE and removal of non-causal neighbors, we have also proposed a pragmatic methodology to define a minimal set for conditioning variables. Our numerical experiments show that our approach can be used to reliably infer both the directionality and the delay of the interactions between the signals even at considerably high N L. Causal network inference from realworld data of MEA recordings demonstrates that the application of the proposed method can infer network interactions during ictal activity and their associated delays.

Minimal set of neighbors for conditioning
In order to test if a node (signal) m has a causal influence on node (signal) n, the standard and most common approach is to use the Peter and Clarke algorithm [47], which tests the conditional independence between two nodes given all other variables. Non-existence of a causal relationship between m and n is established once the algorithm finds that m and n are conditionally independent given other variables. When inferring causality on multivariate time series, the conditioning set could have many variables resulting in unreliable estimates of information theoretic measures, particularly when the sample size is small. We alleviate this problem by defining a selection of variables to condition on, based on the common information shared between them. Furthermore, we restrict the number of variables to condition on, r ≤ 3, according to the finite data size (T ≈ 10000 samples), since conditioning on a higher number of variables resulted in unreliable estimates of CE. In addition to the Peter and Clarke algorithm, there are other iterative, constraintbased approaches that have been proposed for conditioning, including the modified Peter and Clarke algorithm [47] and fast causal inference algorithm [48] as well as score-based algorithms such as greedy equivalence search [49] for defining conditioning sets. However, most of these algorithms suffer from undesirable computational complexity and are not straightforward to implement. A systematic comparison of various conditioning approaches is beyond the scope of this work, where the motivation is to propose and demonstrate M-OPTNs as an extension of OPTNs to reliably infer causality among time series.

Effect of various parameters
The numerical results obtained from simulations of coupled stochastic processes and interacting Lorenz systems, have shown that the proposed approach can successfully capture coupling directions and the associated delays. When applied to more realistic simulations using a network of neural mass models, our approach could reliably recover the underlying causal coupling structure. However, as discussed in detail in the following sections, the performance of the proposed method depends on the choice of several parameters, which can overall be categorized into (1) number of time samples and (2) threshold for identifying significant connections.

Effect of varying the number of time samples
It is well known that the amount of data required for reliable reconstruction of the attractor depends on the embedding dimension M. Since the probability distributions required for the computation of CE are estimated from the ordinal patterns obtained after embedding, inadequate data length might result in unreliable estimates of CE. In our simulations we have set M = 3 and found that the results are reliable if T 10 3 . For a given M, there are (M!) 2 possible pairs of ordinal patterns for which we have to estimate the co-occurrence frequencies. Having T ≤ 10 M samples, results in many spurious interactions being classified as causal links, which is reflected as an increase in the F P R and consequently low F 1 -score as shown in our simulations. For the stochastic model system, interacting Lorenz systems and the network of neural mass models, we used M = 3. Embedding in a higher dimension, for example M = 5 would require T > 100000 samples for reliable embedding and computation of entropy values. The typical sampling frequency of real-world data such as MEA recordings is of the order of 5000-10000 Hz, and to estimate dynamic changes in a causal effect network based on 100000 samples would mean using a window size of 10 to 20 seconds, which may be far too long compared to dynamical changes that occur in neural networks. Also, the use of 100000 or more samples, increases the computation time drastically. Thus, for electrophysiological recordings from neural data, a window size of 2 to 4 seconds seems more realistic, which amounts to having 10000 to 20000 samples per window, depending on the sampling frequency. This in turn means that M should not be greater than 3 or at most 4. In contrast, we observed that varying the embedding delay d did not affect the results qualitatively (not shown here) and we used d = 100 for all our simulations and experimental data.

Effect of varying threshold parameters to define significant connections
The parameter λ determines the connections (direct and indirect) to be classified as significant. Lower values of λ prune away most of the connections, whereas higher values retain most of the connections. Based on our simulations of network of neural mass models, we observed that when the network connectivity is less than 25% and only a moderate amount of observation noise (N L = 0.50) is present, the choice of λ ≥ 0.99 seems to be the optimal setting that results in high T P R and low F P R and, consequently, high F 1 -score , provided that δ is chosen in the range 0.08 ≤ δ ≤ 0.12. However when the N L increases (N L > 0.5), a setting of λ > 0.995 and 0.08 < δ < 0.1 leads to an optimal performance of our algorithm. The aforesaid implies that the choice of δ, which determines the threshold to distinguish a causal neighbor from a non-causal neighbor, depends on the amount of noise present in the data. In the presence of low or moderate observational noise, a truly causal neighbor would result in a high δ as conditioning on this neighbor should reduce the entropy significantly. In contrast, a non-causal neighbor would result in a very small δ. Our results show that the setting δ ≤ 0.1 in such a scenario is a reasonable choice along with λ ≈ 0.995. If the data is very noisy (N L ≥ 1), then the necessary δ for identifying a truly causal neighbor would be very small, thereby making it hard to distinguish from a non-causal neighbour, for which δ should also be small. Thus setting δ too high will prune away all the true connections along with the spurious ones, while setting δ too low might retain some spurious connections.
Another factor that impacts the choice of λ and δ in addition to N L is the number of connections in the network. When the network is densely (in case of our simulations, more than 25%) connected, finding an optimal δ and λ that gives high T P R and low F P R, and consequently a high F 1 -score is more challenging as the estimated network has many spurious connections at multiple delays in addition to the interactions at the correct delays. Any choice of δ to prune away these spurious connections will also yield the removal of true connections as the N L increases.
In summary, our results indicate that setting λ ≈ 0.995 and varying δ between 0.05 and 0.1 should result in reliable network inference, assuming the underlying networks are sparse (see Figs. 12 -14).

Causal network inference from MEA data
As a real-world example, we have applied the proposed method to MEA electrophysiology recordings obtained from an in vitro model of acute limbic seizures. Specifically, we have used 4AP-treated rodent hippocampus-CTX slices as a simplified model of the primary neural circuits involved in temporal lobe epilepsy. This model has been extensively characterized [46] and provides a solid ground-truth to validate our approach.
In a previous study [46], the reported mean time delay for the ictal discharge propagation in the directions CTX → DG, CTX → CA3 and CTX → CA1 was 37.5 ± 9.6 ms, 71.7 ± 27.5 and 31 ± 6.3 ms. In keeping with this, our method has detected connectivity in the direction of CTX → DG, CTX → CA3 and CTX → CA1 at delays of 30 − 38 ms during the ictal discharge. Note that the short delay in the CTX → CA1 direction is due to the signal propagation along the direct temporoammonic pathway [46], which is known to short-circuit the hippocampal loop. Moreover, the connection between SUB and CTX is consistent with the previously reported role of SUB-CTX interactions through the temporoammonic pathway in reinforcing ictal synchronization in animal models of temporal lobe epilepsy [50].
We also found connections in the direction DG → CTX, CA3 → CTX, CA3 → SUB and DG → SUB. However, as these connections are disrupted by the Schaffer Collaterals cut, they represent false positives due to far-field contamination [51] of the signals recorded from CA1, SUB and CTX, wherein far fields originating in CA3 can be seen in Fig. 15. In keeping with this, we found that the strength of the false positive connection is generally lower than the expected true connections. The observation of such false positive interactions could also stem from the common driver issue (see Fig. 1 (b)) wherein one of the CTX ROI is driving both DG and CTX (in the other ROI) causing a spurious DG → CTX connection.
Overall, these results support the reliability and usefulness of our approach for analyzing interactions and their delays in real-world observations, such as electrophysiological time series.

Future work and perspectives
Although the proposed algorithm can reliably perform causal inference, there are certain issues that warrant our attention. First, we have not compared our method to other existing techniques that are commonly used to infer causality from electrophysiological recordings. In fact, the main motivation behind this work was to introduce and provide a proof-of-concept that complex network based time series analysis methods such as OPTNs can be generalized and improved to detect causality from multivariate observations. A systematic comparison of our approach with other commonly used techniques to infer causality is beyond the scope of this paper and will remain a subject of future studies.
Second, we applied our method to in vitro MEA electrophysiology data, which is not as widely used to map connections among brain regions as in vivo recordings. In the case of EEG data we typically do not have the ground-truth to validate our method against. In the case of the MEA data used in this study, previous studies have described the anatomical and functional circuits associated within this brain slice preparation, which served as the reference for our results. Furthermore, estimating causality directly from EEG recordings is not trivial due to the issue of volume conduction. In the case of EEG recordings, to mitigate the volume conduction effects, connectivity is estimated from source timeseries obtained after solving the EEG inverse problem [52]. Thus, the proposed causal inference method has to be applied on the inverse solution, rather than directly to the EEG data, for reliable causal inference. It is not yet clear how such a transformation would alter the structural properties of the time series, and which impact it could have on the estimation of the ordinal patterns remains as a subject of future studies.
Third, we did not perform any surrogate data testing but rather relied on the theoretical maximum H max = log 2 M! and used λH max as threshold for identifying significant bivariate connections, where 0.99 ≤ λ < 1, as due to the finite sample-size, two independent processes will not have a CE value exactly equal to H max . We found this approach to be much faster than generating bivariate surrogates that gave essentially similar results (not shown). In STEP VI of the proposed algorithm, we use the threshold δ to distinguish between causal and non-causal neighbors. This step can be considerably improved by performing significance testing for the conditional independence test as proposed in [1], which preserves, for example, the association between X and Y in the coupling scheme X ← Z → Y , that would otherwise be destroyed in a strictly bivariate permutation scheme.

Conclusions
In this paper, we have developed a new method based on OPTNs to infer causality from multivariate observational data. The proposed method allows to infer causality at different delays and can be adapted to provide a time-varying measure of causality. We have also proposed an iterative scheme to find a minimal set of neighbors for conditioning to yield a reliable estimation of the co-occurrence entropy as the employed coupling indicator. We have demonstrated the validity of our approach using different types of simulated signals as well as real-world electrophysiological time series.
In conclusion, the proposed approach provides a complementary tool for detecting causality from multivariate time series data and can be particularly useful in the area of neuroscience, where the estimation of (time-varying) causal networks from electrophysiology recordings has remained a fundamental problem so far.