Transfer entropy—a model-free measure of effective connectivity for the neurosciences
- First Online:
- Received:
- Revised:
- Accepted:
- 200 Citations
- 6.3k Downloads
Abstract
Understanding causal relationships, or effective connectivity, between parts of the brain is of utmost importance because a large part of the brain’s activity is thought to be internally generated and, hence, quantifying stimulus response relationships alone does not fully describe brain dynamics. Past efforts to determine effective connectivity mostly relied on model based approaches such as Granger causality or dynamic causal modeling. Transfer entropy (TE) is an alternative measure of effective connectivity based on information theory. TE does not require a model of the interaction and is inherently non-linear. We investigated the applicability of TE as a metric in a test for effective connectivity to electrophysiological data based on simulations and magnetoencephalography (MEG) recordings in a simple motor task. In particular, we demonstrate that TE improved the detectability of effective connectivity for non-linear interactions, and for sensor level MEG signals where linear methods are hampered by signal-cross-talk due to volume conduction.
Keywords
Information theory Effective connectivity Causality Information transfer Electroencephalography Magnetoencephalography1 Introduction
Science is about making predictions. To this aim scientists construct a theory of causal relationships between two observations. In neuroscience, one of the observations can often be manipulated at will, i.e. a stimulus in an experiment, and the second observation is measured, i.e. neuronal activity. If we can correctly predict the behavior of the second observation we have identified a causal relationship between stimulus and response. However, identifying causal relationships between stimuli and responses covers only part of neuronal dynamics—a large part of the brain’s activity is internally generated and contributes to the response variability that is observed despite constant stimuli (Arieli et al. 1996). For the case of internally generated dynamics it is rather difficult to infer a physical causality because a deliberate manipulation of this aspect of the system is extremely difficult. Nevertheless, we can try to make predictions based on the concept of causality as it was introduced by Wiener (1956). In Wiener’s definition an improvement of the prediction of the future of a time series X by the incorporation of information from the past of a second time series Y is seen as an indication of a causal interaction from Y to X. Such causal interactions across brain structures are also called ‘effective connectivty’ (Friston 1994) and they are thought to reveal the information flow associated to neuronal processing much more precisely than functional connectivity, which only reflects the statistical covariation of signals as typically revealed by cross-correlograms or coherency measures. Therefore, we must identify causal relationships between parts of the brain, be they single cells, cortical columns, or brain areas.
Various measures of causal relationships, or effective connectivity, exist. They can be divided into two large classes: those that quantify effective connectivity based on the abstract concept of information of random variables (e.g. Schreiber 2000), and those based on specific models of the processes generating the data. Methods in the latter class are most widely used to study effective connectivity in neuroscience, with Granger causality (GC, Granger 1969) and dynamic causal modeling (DCM, Friston et al. 2003) arguably being most popular. In the next two paragraphs we give a short overview over the data generation models in GC and DCM and their specific consequences so that the reader can appreciate the fundamental differences between these model based approaches and the information theoretic approach presented below:
Standard implementations of GC use a linear stochastic model for the intrinsic dynamics of the signal and a linear interaction.^{1} Therefore, GC is only well applicable when three prerequisites are met: (a) The interaction between the two units under observation has to be well approximated by a linear description, (b) the data have to have relatively low noise levels (see e.g. Nalatore et al. 2007), and (c) cross-talk between the measurements of the two signals of interest has to be low (Nolte et al. 2008). Frequency domain variants of GC such as the partial directed coherence or the directed transfer function fall in the same category (Pereda et al. 2005).
DCM assumes a bilinear state space model (BSSM). Thus, DCM covers non-linear interactions—at least partially. DCM requires knowledge about the input to the system, because this input is modeled as modulating the interactions between the parts of the system (Friston et al. 2003). DCM also requires a certain amount of a priori knowledge about the network of connectivities under investigation, because ultimately DCM compares the evidence for several competing a priori models with respect to the observed data. This a priori knowledge on the input to the system and on the potential connectivity may not always be available, e.g. in studies of the resting-state. Therefore, DCM may not be optimal for exploratory analyses.
- 1.
It should not require the a priori definition of the type of interaction, so that it is useful as a tool for exploratory investigations.
- 2.
It should be able to detect frequently observed types of purely non-linear interactions. This is because strong non-linearities are observed across all levels of brain function, from the all-or none mechanism of action potential generation in neurons to non-linear psychometric functions, such as the power-law relationship in Weber’s law or the inverted-U relationship between arousal levels and response speeds described in the Yerkes-Dodson law (Yerkes and Dodson 1908).
- 3.
It should detect effective connectivity even if there there is a wide distribution of interaction delays between the two signals, because signaling between brain areas may involve multiple pathways or transmission over various axons that connect two areas and that vary in their conduction delays (Swadlow and Waxman 1975; Swadlow et al. 1978).
- 4.
It should be robust against linear cross-talk between signals. This is important for the analysis of data recorded with electro- or magnetoencephalography, that provide a large part of the available electrophysiological data today.
The fact that a potential new method should be as model free as possible naturally leads to the application of information theoretic techniques. Information theory (IT) sets a powerful framework for the quantification of information and communication (Shannon 1948). It is not surprising then that information theory also provides an ideal basis to precisely formulate causal hypotheses. In the next paragraph, we present the connection between the quantification of information and communication and Wiener’s definition of causal interactions (Wiener 1956) in more detail because of its importance for the justification of using IT methods in this work.
In the context of information theory, the key measure of information of a discrete^{2} random variable is its Shannon entropy (Shannon 1948; Reza 1994). This entropy quantifies the reduction of uncertainty obtained when one actually measures the value of the variable. On the other hand, Wiener’s definition of causal dependencies rests on an increase of prediction power. In particular, a signal X is said to cause a signal Y when the future of signal Y is better predicted by adding knowledge from the past and present of signal X than by using the present and past of Y alone (Wiener 1956). Therefore, if prediction enhancement can be associated to uncertainty reduction, it is expected that a causality measure would be naturally expressible in terms of information theoretic concepts.
First attempts to obtain model-free measures of the relationship between two random variables were based on mutual information (MI). MI quantifies the amount of information that can be obtained about a random variable by observing another. MI is based on probability distributions and is sensitive to second and all higher order correlations. Therefore, it does not rely on any specific model of the data. However, MI says little about causal relationships, because of its lack of directional and dynamical information: First, MI is symmetric under the exchange of signals. Thus, it cannot distinguish driver and response systems. And second, standard MI captures the amount of information that is shared by two signals. In contrast, a causal dependence is related to the information being exchanged rather than shared (for instance, due to a common drive of both signals by an external, third source). To obtain an asymmetric measure, delayed mutual information, i.e. MI between one of the signals and a lagged version of another has been proposed. Delayed MI results in an asymmetric measure and contains certain dynamical structure due to the time lag incorporated. Nevertheless, delayed mutual information has been pointed out to contain certain flaws such as problems due to a common history or shared information from a common input (Schreiber 2000).
Transfer entropy naturally incorporates directional and dynamical information, because it is inherently asymmetric and based on transition probabilities. Interestingly, Paluš has shown that transfer entropy can be rewritten as a conditional mutual information (Paluš 2001; Hlavackova-Schindler et al. 2007).
The main convenience of such an information theoretic functional designed to detect causality is that, in principle, it does not assume any particular model for the interaction between the two systems of interest, as requested above. Thus, the sensitivity of transfer entropy to all order correlations becomes an advantage for exploratory analyses over GC or other model based approaches. This is particularly relevant when the detection of some unknown non-linear interactions is required.
Here, we demonstrate that transfer entropy does indeed fulfill the above requirements 1–4 and is therefore a useful addition to the available methods for the quantification of effective connectivity, when used as a metric in a suitable permutation test for independence. We demonstrate its ability to detect purely non-linear interactions, its ability to deal with a range of interaction delays, and its robustness against linear cross-talk on simulated data. This latter point is of particular interest for non-invasive human electrophysiology using EEG or MEG. The robustness of TE against linear cross-talk in the presence of noise, has to our knowledge not been investigated before. We test transfer entropy on a variety of simulated signals with different signal generation dynamics, including biologically plausible signals with spectra close to 1/f. We also investigate a range of linear and purely non-linear coupling mechanisms. In addition, we demonstrate that transfer entropy works without specifying a signal model, i.e. that requirement 1 is fulfilled. We extend earlier work (Hinrichs et al. 2008; Chávez et al. 2003; Gourvitch and Eggermont 2007) by explicitly demonstrating the applicability of transfer entropy for the case of linearly mixed signals.
2 Methods
The method section is organized in four main parts. In the first part we describe how to compute TE numerically. As several estimation techniques could be applied for this purpose we quickly review these possibilities and give the rationale for our particular choice of estimator. In the second part, we describe two particular problems that arise in neuroscience applications—delayed interactions, and observation of the signals of interest by measurements that only represent linear mixtures of these signals. The third part provides details on the simulation of test cases for the detection of effective connectivity via TE. The last part contains details of the MEG recordings in a self-paced finger-lifting task that we chose as a proof-of-concept for the analysis of neuroscience data.
2.1 Computation of transfer entropy
In the next subsection we detail both, how to obtain an data-efficient estimation of Eq. 3 from the raw signals, and a statistical significance analysis based on surrogate data.
2.1.1 Reconstructing the state space
Experimental recordings can only access a limited number of variables which are more or less related to the full state of the system of interest. However, sensible causality hypotheses are formulated in terms of the underlying systems rather than on the signals being actually measured. To partially overcome this problem several techniques are available to approximately reconstruct the full state space of a dynamical system from a single series of observations (Kantz and Schreiber 1997).
This procedure depends on two parameters, the dimension d and the delay τ of the embedding. While there is an extensive literature on how to choose such parameters, the different methods proposed are far away from reaching any consensus (Kantz and Schreiber 1997). A popular option is to take the delay embedding τ as the auto-correlation decay time (\(\mathit{act}\)) of the signal or the first minimum (if any) of the auto-information. To determine the embedding dimension, the Cao criterion offers an algorithm based on false neighbors computation (Cao 1997). However, alternatives for non-deterministic time-series are available (Ragwitz and Kantz 2002).
The parameters d and τ considerably affect the outcome of the TE estimates. For instance, a low value of d can be insufficient to unfold the state space of a system and consequently degrade the meaning of any TE measure, as will be demonstrated below. On the other hand, a too large dimensionality makes the estimators less accurate for a given data length and significantly enlarges the computing time. Consequently, while we have used the recipes described above to orient our search for good embedding parameters, we have systematically scanned d and τ to optimize the performance of TE measures.
2.1.2 Estimating the transfer entropy
Thus, the problem amounts to computing the different joint and marginal probability distributions implicated in Eq. (5). In principle, there are many ways to estimate such probabilities and their performance strongly depends on the characteristics of the data to be analyzed. See Hlavackova-Schindler et al. (2007) for a detailed review of techniques. For discrete processes, the probabilities involved can be easily determined by the frequencies of visitation of different states. For continuous processes, the case of main interest in this study, a reliable estimation of the probability densities is much more delicate since a continuous density has to be approximated from a finite number of samples. Moreover, the solution of coarse-graining a continuous signal into discrete states is hard to interpret unless the measure converges when reducing the coarsening scale. In the following, we reason for our choice of the estimator and describe its functioning.
A possible strategy for the design of an estimator relies on finding the parameters that best fit the sample probability densities into some known distribution. While computationally straightforward such approach amounts to assuming a certain model for the probability distribution which without further constraints is difficult to justify. From the nonparametric approaches, fixed and adaptive histogram or partition methods are very popular and widely used. However, other nonparametric techniques such as kernel or nearest-neighbor estimators have been shown to be more data efficient and accurate while avoiding certain arbitrariness stemming from binning (Victor 2002; Kaiser and Schreiber 2002). In this work we shall use an estimator of the nearest-neighbor class.
Nearest-neighbor techniques estimate smooth probability densities from the distribution of distances of each sample point to its k-th nearest neighbor. Consequently, this procedure results in an adaptive resolution since the distance scale used changes according to the underlying density. Kozachenko-Leonenko (KL) is an example of such a class of estimators and a standard algorithm to compute Shannon entropy (Kozachenko and Leonenko 1987). Nevertheless, a naive approach of estimating TE via computing each term of Eq. 5 from a KL estimator is inadequate. To see why, it is important to notice that the probability densities involved in computing TE or MI can be of very different dimensionality (from 1 + d_{x} up to 1 + d_{x} + d_{y} for the case of TE). For a fixed k, this means that different distance scales are effectively used for spaces of different dimension. Consequently, the biases of each Shannon entropy arising from the non-uniformity of the distribution will depend on the dimensionality of the space, and therefore, will not cancel each other.
To overcome such problems in mutual information estimates, Kraskov, Stögbauer, and Grassberger have proposed a new approach (Kraskov et al. 2004). The key idea is to use a fixed mass (k) only in the higher dimensional space and project the distance scale set by this mass into the lower dimensional spaces. Thus, the procedure designed for mutual information suggests to first determine the distances to k-th nearest neighbors in the joint space. Then, an estimator of MI can be obtained by counting the number of neighbors that fall within such distances for each point in the marginal space. The estimator of MI based on this method displays many good statistical properties, it greatly reduces the bias obtained with individual KL estimates, and it seems to become an exact estimator in the case of independent variables. For these reasons, in this work we have followed a similar scheme to provide an data-efficient sample estimate for transfer entropy (Gomez-Herrero et al. 2010). Thus, we have obtained an estimator that permits us, at least partially, to tackle some of the main difficulties faced in neuronal data sets mentioned in the beginning of the Methods section. In summary, since the estimator is more data efficient and accurate than other techniques (especially those based on binning), it allows to analyze shorter data sets possibly contaminated by small levels of noise. At the same time, the method is especially geared to handle the biases of high dimensional spaces naturally occurring after the embedding of raw signals.
As to computing time, this class of methods spends most of resources in finding neighbors. It is then highly advisable to implement an efficient search algorithm which is optimal for the length and dimensionality of the data to be analyzed (Cormen et al. 2001). For the current investigation, the algorithm was implemented with the help of OpenTSTool (Version1.2 on Linux 64 bit; Merkwirth et al. 2009). The full set of methods applied here is available as an open source MATLAB toolbox (Lindner et al. 2009).
In practice, it is important to consider that this kernel estimation method carries two parameters. One is the mass of the nearest-neighbors search (k) which controls the level of bias and statistical error of the estimate. For the remainder of this manuscript this parameter was set to k = 4, as suggested in Kraskov et al. (2004), unless stated otherwise. The second parameter refers to the Theiler correction which aims to exclude autocorrelation effects from the density estimation. It consists of discarding for the nearest-neighbor search those samples which are closer in time to a reference point than a given lapse (T). Here, we chose \(T= 1 \ \mathit{act}\), unless stated otherwise. In general, it means that even though TE does not assume any particular model, its numerical estimation relies on at least five different parameters; the embedding delay (τ) and dimension (d), the mass of the nearest neighbor search (k), the Theiler correction window (T), and the prediction time (u). The latter accounts for non-instantaneous interactions. Specifically it reflects that in that case an increment of predictability of one signal thanks to the incorporation of the past of others should only occur for a certain latency or prediction time. Since axonal conduction delays among remote areas can amount to tens of milliseconds (Swadlow and Waxman 1975; Swadlow 1994), its incorporation for a sensible causality analysis of neuronal data sets is important for the results as we shall see below.
2.1.3 Significance analysis
To test the statistical significance of a value for TE obtained we used surrogate data. In general, generating surrogate data with the same statistical properties as the original data but selectively destroying any causal interaction is difficult. However, when the data set has a trial structure it is possible to reason that shuffling trials generates suitable surrogate data sets for the absence of causality hypothesis if stationarity and trial independency are assured. On these data we have then used a permutation test (~19,000 permutations) on the unshuffled and shuffled trials to obtain a p-value. P-values below 0.05 were considered significant. Where necessary a correction of this threshold for multiple comparisons was applied using the false discovery rate (FDR, q < 0.05; Genovese et al. 2002).
2.2 Particular problems in neuroscience data: instantaneous mixing and delayed interactions
Neuroscience data have specific characteristics that challenge a simple analysis of effective connectivity. First, the interaction may involve large time delays of unknown duration and, second, the data generated by the original processes may not be available but only measurements that represent linear mixtures of the original data—as is the case in EEG and MEG. In this section we describe a number of additional tests that may help to interpret the results obtained by computing TE values from these types of neuroscience data.
Tests for instantaneous linear mixing and for multiple noisy observations of a single source
Delayed interactions, Wiener’s definition of causality, and choice of embedding parameters
2.3 Simulated data
We used simulated data to test the ability of TE to uncover causal relations under different situations relevant to neuroscience applications. In particular, we always considered two interacting systems and simulated different internal dynamics (autoregressive and 1/f characteristics), effective connectivity (linear, threshold and quadratic coupling), and interaction delays (single delay and a distribution of delays). In addition, we simulated linear instantaneous mixing processes during measurement, because of their relevance for EEG and MEG.
2.3.1 Internal signal dynamics
2.3.2 Types of interaction
where the sums are extended over a certain domain of positive integer values. In the results section we consider the case in which δ′ takes values on a uniform distribution of width 6 centered around a given delay.
The coupling constants γ_{lin}, γ_{quad}, γ_{thresh} were always chosen such that the variance of the interaction term was comparable to the variance of y(t) that would be obtained in the absence of any coupling.
2.3.3 Linear mixing
- (A)The first test case consisted in unidirectionally coupled signal pairs X →Y generated from coupled AR(10) processes as described above and then transformed into two linear instantaneous mixtures X_{ε},Y_{ε} in the following way:$$ \label{eq:mixX} X_{\epsilon}(t)=(1-\epsilon)X(t)+\epsilon Y(t) $$(15)Here, ε is a parameter that describes the amount of linear mixing or ‘signal cross-talk’. A value of ε of 0.5 means that the mixing leads to two identical signals and, hence, no significant TE should be observed. We then investigated for three different values of ε = (0.1,0.25,0.4) how well TE detects the underlying effective connectivity from X to Y if only the linear mixtures X_{ε},Y_{ε} are available.$$ \label{eq:mixY} Y_{\epsilon}(t)=\epsilon X(t)+(1-\epsilon)Y(t) $$(16)
- (B)The second test case consisted in generating measurement signals X_{ε},Y_{ε} in the following way:$$ \label{eq:mixinstant} X_{\epsilon}(t)=s(t) $$(17)$$ Y_{\epsilon}(t)=(1-\epsilon) s(t) + \epsilon \eta_{Y} $$(18)
Here, s(t) is the common source, a mean-free AR(10) process with unit variance. s(t) is measured twice: once noise free in X_{ε} and once dampened by a factor (1 − ε) and corrupted by independent Gaussian noise of unit variance, η_{Y}, in Y_{ε}. Here, we tested the ability of our implementation of TE to reject the hypothesis of effective connectivity. This second test case is of particular importance for the application of TE to EEG and MEG measurements where often a single source may be observed on two sensors that have different noise characteristics, i.e. due to differences in contact resistance of the EEG electrodes or the characteristics of the MEG-SQUIDS.
2.3.4 Choice of embedding parameters for delayed interactions
Detection of true and false effective connectivity for a fixed embedding dimension d of 7, and an embedding delay τ of 1 autocorrelation time
Dynamics | δ | Coupling | u | X →Y | Y →X |
---|---|---|---|---|---|
True | False | ||||
AR(10) | 5 | Lin | 6 | 1 | 1 |
AR(10) | 5 | Lin | 21 | 1 | 0 |
AR(10) | 5 | Lin | 101 | 0 | 0 |
AR(10) | 5 | Threshold | 6 | 1 | 1 |
AR(10) | 5 | Threshold | 21 | 1 | 0 |
AR(10) | 5 | Threshold | 101 | 0 | 0 |
AR(10) | 5 | Quadratic | 6 | 1 | 1 |
AR(10) | 5 | Quadratic | 21 | 1 | 0 |
AR(10) | 5 | Quadratic | 101 | 0 | 0 |
AR(10) | 20 | Lin | 6 | 1 | 1 |
AR(10) | 20 | Lin | 21 | 1 | 0 |
AR(10) | 20 | Lin | 101 | 1 | 0 |
AR(10) | 20 | Threshold | 6 | 0 | 0 |
AR(10) | 20 | Threshold | 21 | 1 | 0 |
AR(10) | 20 | Threshold | 101 | 0 | 0 |
AR(10) | 20 | Quadratic | 6 | 0 | 0 |
AR(10) | 20 | Quadratic | 21 | 1 | 0 |
AR(10) | 20 | Quadratic | 101 | 0 | 0 |
AR(10) | 100 | Lin | 6 | 1 | 0 |
AR(10) | 100 | Lin | 21 | 1 | 0 |
AR(10) | 100 | Lin | 101 | 1 | 0 |
AR(10) | 100 | Threshold | 6 | 0 | 0 |
AR(10) | 100 | Threshold | 21 | 0 | 0 |
AR(10) | 100 | Threshold | 101 | 1 | 0 |
AR(10) | 100 | Quadratic | 6 | 1 | 0 |
AR(10) | 100 | Quadratic | 21 | 1 | 0 |
AR(10) | 100 | Quadratic | 101 | 1 | 0 |
Detection of true and false effective connectivity in dependence of the parameters embedding delay τ, embedding dimension d, and prediction time u for data with unidirectional coupling X →Y via a quadratic function, 1/f dynamics and an interaction delay δ of 100 samples
Dynamics | δ | d | u | τ [ACT] | X →Y | Y →X |
---|---|---|---|---|---|---|
1/f | 100 | 4 | 21 | 1 | 0 | 0 |
1/f | 100 | 4 | 101 | 1 | 1 | 1 |
1/f | 100 | 7 | 21 | 1 | 0 | 1 |
1/f | 100 | 7 | 101 | 1 | 1 | 0 |
1/f | 100 | 10 | 21 | 1 | 0 | 0 |
1/f | 100 | 10 | 101 | 1 | 1 | 0 |
1/f | 100 | 4 | 21 | 1.5 | 0 | 0 |
1/f | 100 | 4 | 101 | 1.5 | 1 | 0 |
1/f | 100 | 7 | 21 | 1.5 | 0 | 0 |
1/f | 100 | 7 | 101 | 1.5 | 1 | 0 |
1/f | 100 | 10 | 21 | 1.5 | 0 | 0 |
1/f | 100 | 10 | 101 | 1.5 | 1 | 0 |
2.4 MEG experiment
Rationale
In order to demonstrate the applicability of TE to neuroscience data obtained non-invasively we performed MEG recordings in a motor task. Our aim was to show that TE indeed gave the results that were expected based on prior, neuroanatomical knowledge. To verify the correctness of results in experimental data is difficult because no knowledge about the ultimate ground truth exists when data are not simulated. Therefore, we chose an extremely simple experiment—self-paced finger lifting of the index fingers in a self-chosen sequence—where very clear hypotheses about the expected connectivity from the motor cortices to the finger muscles exist.
Subjects and experimental task
Two subjects (S1, m, RH, 38 yrs; S2, f, RH, 23 yrs) participated in the experiment. Subjects gave written informed consent prior to the recording. Subjects had to lift the right and left index finger in a self-chosen randomly alternating sequence with approximately 2s pause between successive finger liftings. Finger movements were detected using a photosensor. In addition, an electromyogram (EMG) response was recorded from the extensor muscles of the the right and left index fingers.
Recording and preprocessing
MEG data were recorded using a 275 channel whole head system (OMEGA2005, VSM MedTech Ltd., Coquitlam, BC, Canada) in a synthetic 3rd order gradiometer configuration. Additional electrocardiographic, -occulographicc and -myographic recordings were made to measure the electrocardiogram (ECG), horizontal and vertical electrooculography (EOG) traces, and the electromyogram (EMG) for the extensor muscles of the right and left index fingers. Data were hardware filtered between 0.5 and 300 Hz and digitized at a sampling rate of 1.2 kHz. Data were recorded in two continuous sessions lasting 600 s each. For the analysis of effective connectivity between scalp sensors and the EMG, data were preprocessed using the Fieldtrip open-source toolbox for MATLAB (http://fieldtrip.fcdonders.nl/; version 2008-12-10). Data were digitally filtered between 5 and 200 Hz and then cut in trials from −1,000 ms before to 90 ms after the photosensor indicated a lift of the left or right index finger. This latency range ensured that enough EMG activity was included in the analysis. We used the artifact rejection routines implemented in Fieldtrip to discard trials contaminated with eye-blinks, muscular activity and sensor jumps.
Analysis of effective connectivity at the MEG sensor level using transfer entropy
Effective connectivity was analyzed using the algorithm to compute transfer entropy as described above. The algorithm was implemented as a toolbox (Lindner et al. 2009) for Fieldtrip data structures (http://fieldtrip.fcdonders.nl/) in MATLAB. The nearest neighbour search routines were implemented using OpenTSTool (Version1.2 on Linux 64 bit; Merkwirth et al. 2009). Parameters for the analysis were chosen based on a scanning of the parameter space, to obtain maximum sensitivity. In more detail we computed the difference between the transfer entropy for the MEG data and the surrogate data for all combinations of parameters chosen from: \(\tau=1\ \mathit{act}\), u ∈ [10,16,22,30,150], d ∈ [4,5,7], k ∈ [4,5,6,7,8,9,10]. We performed the statistical test for a significant deviation from independence for each of these parametersets. This way a multiple testing problem arose, in addition to the multiple testing based on the multiple directed intercations between the chosen sensors (see next paragraph). We therefore performed a correction for multiple comparisons using the false discovery rate (FDR, q < 0.05, Genovese et al. 2002). The parameter values with optimum sensitivity, i.e. most sginificant results across sensor pairs after corrcetion for multiple comparison were: embedding dimensions d = 7, embedding delay \(\tau = 1\ \mathit{act}\), forward prediction time u = 16 ms, number of neighbors considered for density estimations k = 4, time window for exclusion of temporally correlated neighbors \(T = 1 \mathit{act}\). In addition we required that prediction should be possible for at least 150 samples, i.e. individual trials where the combination of a long autocorrelation time and the embedding dimension of 7 did not leave enough data for prediction were discarded. We required that at least 30 trials should survive this exclusion step for a dataset to be analyzed.
A slow rhythm (6–10 Hz) has been postulated to provide a common timing for agonist/antagonist muscles pairs in slow movements and is thought to arise from from synchronization in a cerebello-thalamo-cortical loop (Gross et al. 2002). The coupling of cortical (primary motor cortex M1, primary somatosensory cortex S1) activity to muscular activity was proposed to be bidirectional (Gross et al. 2002) in this frequency range. The coupling may also depend on oscillations in spinal stretch reflex loops (Erimaki and Christakos 2008).
Activity in the beta range (~20 Hz) has been suggested to subserve the maintenance of current limb position (Pogosyan et al. 2009) and strong cortico-muscular coherence in this band has been found in isometric contraction accordingly (Schoffelen et al. 2008). Coherent activity in the beta band has also been demonstrated between bilateral motor cortices (Mima et al. 2000; Murthy and Fetz 1996).
In contrast, motor-act related activity in the gamma band (>30 Hz) is reported less frequently and its relation to motor control is less clearly understood to date (Donoghue et al. 1998). We therefore focused our analysis on a frequency interval from 5–29 Hz.
3 Results
3.1 Overview
In this section we first present the analysis of effective connectivity in pairs of simulated signals {X,Y}. All signal pairs were unidirectionally coupled from X to Y. We used three coupling functions: linear, threshold and a purely non-linear quadratic coupling. We simulated two different signal dynamics, AR(10) processes and processes with 1/f spectra, that were close to spectra observed in biological signals. The two signals of a pair always had similar characteristics. We always analyzed both directions of potential effective connectivity: X →Y and Y →X to quantify both, sensitivity and specificity of our method.
In addition to this basic simulation we investigated the following special cases: coupling via multiple coupling delays for linear and threshold interactions, linearly mixed observation of two coupled signals for linear and threshold coupling, and observation of a single signal via two sensors with different noise levels. In this last case no effective connectivity should be detected. The absence of false positives in this latter case is of particular importance for EEG and MEG sensor-level analysis.
As a proof of principle we then applied the analysis of effective connectivity via TE to MEG signals recorded in a self-paced finger lifting task. Here the aim was to recover the known connectivity from contralateral motor cortices to the muscles of the moved limb, via a comparison of effective connectivty for left and right finger lifting.
3.2 Simulation study
Detection of non-linear interactions for various signal dynamics
Detection of interactions with multiple interaction delays
Detection of effective connectivity from linearly mixed measurement signals
Robustness against instantaneous mixing
Choosing embedding parameters for delayed interactions
To demonstrate the importance of correct embedding we simulated unidirectionally coupled signals with various interaction delays and analyzed effective connectivity with different choices for the embedding dimension d, the embedding delay τ and the prediction time u (Tables 1 and 2). As expected because of theoretical considerations (see Fig. 1), false positive effective connectivity is reported for short interaction delays (5, 20 samples) in combination with short prediction times (six samples) and insufficient embedding (d = 4, \(\tau=1 \ \mathit{act}\)). In contrast, if we try to detect long interactions delays (δ = 20, 100) with too short prediction times (u = 6), again with insufficient embedding, the method looses its sensitivity, as expected. This indicates that for given analysis parameters (d,τ,u) the range of interaction delays δ that can be investigated reliably is limited (Table 1). The above problem is solved naturally by increasing embedding dimensions and embedding delays as demonstrated in Table 2—although this may not be possible in practical terms sometimes. In our simulations we generally found an embedding delay of \(\tau=1.5 \ \mathit{act}\) in combination with embedding dimensions between 7 and 10 to be more appropriate than smaller (d = 4, also see Table 2) or larger embedding dimensions (d = 13, 16, 19, data not shown) or a shorter embedding delay (\(\tau = 1\ \mathit{act}\)). While it is often proposed to use \(\tau = 1\ \mathit{act}\) for embedding our data suggest that for the evaluation of TE it is particularly important to cover most or all of the memory inherent in both, source and target signals. For our data this could be be achieved by choosing \(\tau \ > \ 1.5 \ \mathit{act}\) to prevent against false positive detection of causality in the presence of delayed interactions. We also observed that values of the prediction time u close to the actual interaction delay δ made the analysis of TE both, more sensitive and more robust against false positives, even for suboptimal choices of d and τ (Tables 1 and 2). Hence, a choice of u close to δ, e.g. based on prior (e.g. anatomical) knowlegde, may yield a method that is more robust in the face of unkown and hard to determine values for d and τ.
3.3 Effective connectivity at the MEG sensor level
Motor evoked fields
Movement related effective connectivity
4 Discussion
Transfer entropy as a tool to quantify effective connectivity
In the present study we aimed to demonstrate that TE is a useful addition to existing methods for the quantification of effective connectivity. We argued that existing methods like GC, that are based on linear stochastic models of the data, may have difficulties detecting purely non-linear interactions, such as inverted-U relationships. Here, we could show that transfer entropy reliably detected effective connectivity correctly when two signals were coupled by a quadratic, i.e. purely non-linear, function (Fig. 2). Particularly relevant for neural interactions, we have also shown that couplings mediated by threshold or sigmoidal functions are correctly captured by TE.
Furthermore, we extended the original definition of TE to deal with long interaction delays and demonstrated that TE detected effective connectivity correctly when the coupling of two signals was mediated by multiple interactions that spanned a range of latencies (Fig. 3).
Moreover, we considered the problem of volume conduction and showed that TE robustly detected effective connectivity when only linear mixtures of the original coupled signals were available (Figs. 4 and 5), if signals were not too close to being identical. In addition, if the two measurements reflected a common underlying source signal (’common drive’) but had different levels of measurement noise added, TE in combination with a test on time shifted data, correctly rejected the hypothesis of effective connectivity between the two measurement signals, in contrast to a naive application of GC (Nolte et al. 2008). Therefore, TE in combination with this test is well applicable to EEG and MEG sensor-level signals, where linear instantaneous mixing is inherent in the measurement method. However, without the additional test on time shifted data, TE had a non-negligible rate of false positives detections of effective connectivity. The origin of these false positives can be understood as follows. Theoretically transfer entropy is zero in the absence of causality, i.e. when processes are fully independent—as should be the case for surrogate data. TE is also zero for identical copies of a single signal, as required from a causality measure, when driver and response system cannot be distinguished. Here, we considered the case of volume conduction of a single signal onto two sensors in the presence of additional noise. Hence, the use of surrogate data for a test of the causality hypothesis inevitably leads to the comparison of two (noisy) zeros and false positives. Because of this difficulty we suggest to perform the time-shift test whenever multiple observations of a single source signal are likely to be present in the data, as is the case for EEG and MEG measurements.
Last but not least, we proposed TE as a tool for the exploratory investigation of effective connectivity, because it is a model-free measure based on information theory. Complicated types of coupling such as cross-frequency phase coupling (Palva et al. 2005) should be readily detectable without prior specification, e.g. the coupling via a quadratic function—as investigated here—, introduces a frequency doubled (and distorted) input to the target signal. Nevertheless it was readily detected by TE. While the argument on model-freeness holds theoretically, any practical implementation comes with certain parameters that have to be adapted to the data empirically, such as the correct choice of a delay τ and the number of dimensions d used for delay embedding. In addition, the implementation of TE proposed here incorporates a parameter for the prediction time u to adapt the analysis for cases where a long interaction delay is present. If chosen ad hoc these parameters amount to a sort of model for the data. To keep the method model-free we therefore proposed to scan a sufficiently large parameter space on pilot data before analyzing the data of interest or to scan the parameter space and to correct for the arising multiple comparison problem later on, during statistical testing.
To handle the estimation of TE, the parameter scanning and the statistical testing, including the shift-test, we implemented the proposed procedure in the form of a convenient open-source MATLAB toolbox for the Fieldtrip data format that is available from the authors (Lindner et al. 2009).
Limitations
Despite the above-mentioned merits, the TE method also has limitations that have to be considered carefully to avoid misinterpretations of the results:
We note that model-freeness is not always an advantage. In contrast to model-based methods, the detection of effective connectivity via TE does not entail information on the type of interaction. This fact has two important consequences. First, the absence of a specific model of the interaction leads to a high sensitivity for all types of depedencies between two time-series. This way, trivial (nuisance) dependencies, might be detected by testing against surrogates. This is bound to happen if these dependencies are not kept intact when creating the surrogate data. Second, the specific type of interaction must be separately assessed post hoc by using model based methods, after the presence of effective connectivity was established using transfer entropy. In principle the analysis of effective connectivity using TE, and the post-hoc comparison of signal pairs with and without significant interaction in an exploratory search of the actual mechanism of this interaction are possible in the same dataset. This is because these two questions are orthogonal. However, the relationship between siginificant effective connectivity—detected by TE—and a specific mechanism of the interaction needs to be tested on independent data.
- 1.
The decsription of all system involved has to be causally complete, i.e. there must not be unobserved common causes that do not enter the analysis.
- 2.
If two systems are related by a deterministic map, no causality can be inferred. This would exclude systems exhibiting complete synchronization, for example. Technically this is reflected in Eq. 4: For TE to be well defined the probability densities and their logarithms must exist. Therefore δ-distributions in the joint embedding space of two signals, which are equivalent to deterministic maps between these signals, are excluded.
- 3.
The concept of observational causality rests on the axiom that the past and present may cause the future but the future may not cause the past. For this axiom to be useful observations must be made at a rate that allows a meaningful distinction between past, present and future with respect to the interaction delays involved. This means that interactions that take place on a timescale faster than the sampling rate must be missed in methods based on observational causality.
Application of TE to MEG recordings in a motor task
As a proof-of-principle, we applied TE to MEG data recorded during self paced finger lifting. The analysis of the effective connectivity from MEG to EMG signals was performed without the recommended rectification of the EMG signal (Myers et al. 2003) to proove that TE could perform the analysis well without this step. Our expectations of stronger effective connectivity from contralateral motor cortex to the moved finger were met for both fingers in both investigated subjects. Surprisingly, however, we also found stronger effective connectivity from ipsilateral motor cortex to the moved finger. It is not clear at present whether this effective connectivity reflected an indirect interaction: Contralateral motor cortex may drive both, ipsilateral cortex and the muscles of the moved finger, albeit with strongly differing delays. In this case, TE may erroneously detect effective connectivity from ipsilateral cortex to the muscle, as discussed above. Additional analyses, quantifying the coupling between the two motor cortices will be necessary to clarify this issue. As discussed below, these analyses should preferentially be performed using a multivariate extension of the TE method.
Comparison to existing literature
The application of non-linear methods to detect effective connectivity in neuroscience data has been suggested before: One of the earliest attempts to extend GC to the non-linear case and to apply it to neurophysiological data was presented by Freiwald et al. (1999). They used a locally linear, non-linear autoregressive (LLNAR) model where time varying autoregression coefficients were used to capture non-linearities. This model was only tested, however, on simulations of unidirectionally and linearly coupled signals and correctly identified the coupling as unidirectional and as linear. No attempt was made to validate the model on simulations of explicitly non-linear directed interactions. Application to EEG data from a patient with complex partial seizures indicated non-linear coupling of the signals measured at electrode positions C3 and C4. Another test on local field potential (LFP) data recorded in the anterior inferotemporal cortex (macaque area TE) of the macaque monkey however detected no indication of a non-linear interaction. We add to these results by demonstrating that also purely non-linear (square, threshold) interactions are reliably detected using TE in combination with appropriate statistical testing and by demonstrating that interactions can also be found in MEG and EMG data, even when omitting the usual rectification of the EMG. Chávez et al. (2003) used TE on data from an epileptic patient and also proposed a statistical test based on block-resampling of the data that is similar to the trial shuffling approach used here. They found that TE with a fixed prediction time and a fixed inclusion radius for neighbor search was able to detect the directed linear and non-linear interactions for the simulated models. Our findings are in agreement with these results. In addition, we demonstrated that TE also detects directed non-linear interactions for biologically plausible data with 1/f characteristics and a range of interaction delays instead of a single delay. Hinrichs et al. (2008) used a measure that is very similar to transfer entropy as it was investigated here. However, in contrast to our study they substituted the time-consuming estimation of probability densities by kernel-based methods with a linear method based on the data covariance matrices. As explicitely stated in the mathematical appendix of Hinrichs et al. (2008) this effectively limits the detection of directed interactions to linear ones. Here, we demonstrate that, while being relatively time consuming, a kernel based estimation of the required probability densities is feasible using the Kraskov-Stögbauer-Grassberger estimator (Kraskov et al. 2004), even for a dimensionality of five and higher. We note however, that the amount of data necessary for these estimations may not always be available and that the achievable ‘temporal resolution’ is limited by this factor. Interestingly, scanning of the prediction time u, revealed an optimal interaction delay in the MEG/EMG data of around 16 ms, in accordance with their findings.
Outlook
As demonstrated in this study TE is a useful tool to quantify effective connectivity in neuroscience data. Its ability to detect purely non-linear interactions and to operate without the specification of one or more a priori models make it particularly useful for exploratory data analysis, but its use is not limited to this application. The implementation of TE estimation used here only considered pairs of signals, i.e. it is a bivariate method. Direct and indirect interactions may, therefore, not be separated well. However, an extension to the multivariate case is possible as noted before (e.g. Chávez et al. 2003) and is currently under investigation. Its application to cellular automata by Lizier and colleagues have already revealed interesting insights into the pattern formation and information flow in these models of complex systems (Lizier et al. 2008).
The problem of direct versus indirect interactions can also be ameliorated for the case of MEG and EEG data by performing the analysis at the level of source time-courses obtained from a suitable source analysis method. Using source level time-courses will reduce the number of signals for analysis. A post hoc analysis of the obtained reduced network of effective connectivty by DCM may be possible then. Using source level time-courses will also improve the interpretability of the obtained effective connectivities compared to those at the sensor level. This is because for a given causal interaction observed at the sensor level any of the multiple sources reflected in the sensor signal may be responsible for the observed effective connectivity.
Although the estimation of TE presented here is geared at continuous data TE has found application in the analysis of spiking data as reported in Gourvitch and Eggermont (2007). The particularities to estimate TE from point processes can be found there. Thus, both macroscopic (fMRI, EEG/MEG) and more local signals (LFP, single unit activity) can be readily analized in the common framework of TE. In the future, it will be interesting to compare the effective connectivities for a variety of temporal and spatial scales as revealed by TE.
Conclusion
Transfer entropy robustly detected effective connectivity in simulated data both for complex internal signal dynamics (1/f) and for strongly non-linear coupling. Detection of effective connectivity was possible without specifying an a priori model. With the use of an additional test for linear instantaneous mixing it was robust against false positives due to simulated volume conduction. Therefore it is not only applicable for invasive electrophysiological data but also for EEG and MEG sensor-level analysis. Analysis of MEG and EMG sensor-level data recorded in a simple motor task data revealed the expected connectivity, even without rectification of the EMG signal. We therefore propose TE as a useful tool for the analysis of effective connectivity in neuroscience data.
Historically, however, GC was formulated without explicit assumptions about the linearity of the system (Granger 1969) and was therefore closely related to Wiener’s formal definition of causality (Wiener 1956).
For a continuous random variable the natural generalization of Shannon entropy is its differential entropy. Although differential entropy does not inherit the properties of Shannon entropy as an information measure, the derived measures of mutual information and transfer entropy retain the properties and meaning they have in the discrete variable case. We refer the reader to Kaiser and Schreiber (2002) for a more detailed discussion of TE for continuous variables. In addition, measurements of physical systems typically come as discrete random variables because of the binning inherent in the digital processing of the data.
Acknowledgements
The authors would like to thank Viola Priesemann from the Max Planck Institute for Brain Research, Frankfurt, for valuable comments on this manuscript, German Gomez Herrero from the Technical University of Tampere, Wei Wu from the Humboldt-Universität in Berlin, Mikhail Prokopenko from the CSIRO in Sydney, and Prof. Jochen Triesch from the Frankfurt Institute for Advanced Studies (FIAS) for stimulating discussions, and Sarah Straub from the Department of Psychology, University of Regensburg for assistance in data acquisition.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.