Causality in extremes of time series

Consider two stationary time series with heavy-tailed marginal distributions. We aim to detect whether they have a causal relation, that is, if a change in one causes a change in the other. Usual methods for causal discovery are not well suited if the causal mechanisms only appear during extreme events. We propose a framework to detect a causal structure from the extremes of time series, providing a new tool to extract causal information from extreme events. We introduce the causal tail coefficient for time series, which can identify asymmetrical causal relations between extreme events under certain assumptions. This method can handle nonlinear relations and latent variables. Moreover, we mention how our method can help estimate a typical time difference between extreme events. Our methodology is especially well suited for large sample sizes, and we show the performance on the simulations. Finally, we apply our method to real-world space-weather and hydro-meteorological datasets.


Introduction
The ultimate goal of causal inference is understanding relationships between random variables and predicting the outcomes resulting from their modification or manipulation (Peters et al, 2017). Causal inference finds utility across a wide array of scientific domains. For instance, in the field of medicine, it aids in comprehending the propagation of epileptic seizures across distinct brain regions (Imbens and Rubin, 2015). Or in climate science, it facilitates the prediction of variables such as temperature and rainfall  while unraveling the causes behind sudden changes in river discharges (Mhalla et al, 2020). Extensive efforts have been invested in establishing its mathematical foundation (Pearl, 2009). Structural causal models (SCMs) serve as mathematical tools for representing causal relationships between variables, especially in scenarios where temporal order is unavailable. Recent advancements include estimating causal mechanisms within structural causal models (Peters et al, 2014;Mooij et al, 2016;Glymour et al, 2019). However, causal discovery in time series often necessitates alternative models (Peters et al, 2017, Chapter 10) and faces different problems (Runge et al, 2019a).
A commonly used concept for describing a causality in time series is Granger causality (Granger, 1969(Granger, , 1980. Though various definitions of causality, such as Sims causality (Sims, 1972), structural causality (White and Lu, 2010), and interventional causality (Eichler and Didelez, 2010), coexist, Granger causality holds a prominent position in the literature (Berzuini et al, 2012, Chapter 22.2). It rests upon two key principles: precedence of cause over effect and the unique information contained in the cause that is otherwise unattainable. In this paper, we primarily consider strong Granger causality, which employs conditional independence (see Section 1.3). Variations include linear Granger causality (Hosoya, 1977) and Granger causality in mean (Ho, 2015). In the autoregressive models considered in this paper, all previously mentioned notions are closely related, and their differences are typically not of practical relevance (Berzuini et al, 2012, Chapter 22.8).
Several approaches exist for learning a causal structure in time series. State-of-the-art methods for causal discovery in time series use consecutive conditional independence testing (Zhang and Zhang, 2008) or fitting the vector autoregressive (VAR) models (Eichler, 2012). Runge et al (2019b) introduce a PCMCI method that combines a state-of-the-art PC method (named after the inventors Peter Spirtes and Clark Glymour;Spirtes et al (1993)) and MCI (momentary conditional independence; Pompe and Runge (2011)) to identify and adjust for many potential confounders. However, there is still a need for more robust methods against a variety of hidden confounders. Gerhardus and Runge (2021) introduce the LPCMCI algorithm that is adapted to handle latent confounders. Another approach leverages Shannon's information theory, utilizing entropy and mutual information to probe causal aspects of dynamic and complex systems (Hlaváčková-Schindler et al, 2007).
Typically, causal inference methods in time series describe the causality in the body of the distribution (causality in the mean). Several articles deal with the second-order causality (causality in variance; Gemici and Polat (2021)) using, e.g., GARCH (Generalized Autoregressive Conditional Heteroskedastic) modeling (Hafner and Herwartz, 2008). However, causality in extremes is a new field of research. Different perspectives can be seen when looking mainly at the tails of the distributions (Coles, 2001). Many causal mechanisms are present only during extreme events, and interventions often carry information that is likely to be causal (Cox and Wermuth, 1996). While the connection between extreme value theory and time series has been thoroughly studied (Yang and Hongzhi, 2005;Mikosch and Wintenberger, 2015;Kulik and Soulier, 2020), the area of causal inference at this juncture remains unexplored.
Recent work intertwines extreme value theory and causality in SCM. Engelke and Hitz (2020) propose graphical models in the context of extremes. Kiriliouk and Naveau (2020) study probabilities of necessary and sufficient causation as defined in the counterfactual theory using the multivariate generalized Pareto distributions. Deuber et al (2021) develop a method for estimating extremal quantiles of treatment effects. Further approaches involve recursive max-linear models on directed acyclic graphs (Gissibl and Klüppelberg, 2018;Klüppelberg and Krali, 2021).
Our paper aims to establish a framework of causality in extremes of time series. This work builds on the work of Gnecco et al (2021), who first introduced the concept of causal tail coefficient in the context of SCM, followed by Pasche et al (2022), who extended this coefficient by incorporating possible covariates into the model. We aim to move the theory of causality in extremes from SCMs into a context of Granger-type causality in time series.
The paper is organized as follows. The subsequent section provides a concise overview of the causal tail coefficient's existing developments, presents a motivating time series example, and introduces preliminary results and notation. Section 2 contains the main results and provides an illustrative model example. Section 3 extends the proposed method and discusses outcomes under assumptions violations. The section also addresses scenarios involving hidden confounders and a choice of the minimal time delay. Section 4 addresses the estimation problem, examining estimator properties and employing simulations on synthetic datasets. In Section 5, we apply the method to two real-world datasets. First, we consider an application concerning space weather and geomagnetic storms, corroborating prior findings that employ conditional mutual information. Lastly, we employ our methodology on a hydrometeorological dataset, investigating six distinct weather and climate phenomena. To maintain brevity, proofs are relocated to the Appendix. Appendix A introduces auxiliary propositions used in the proofs, while the proofs themselves can be found in Appendix B.
1.1 SCM and existing work on the causal tail coefficient Gnecco et al (2021) established the groundwork for the causal tail coefficient within linear SCMs (Pearl, 2009). A linear SCM over real random variables X 1 , . . . , X p is a collection of p assignments Γ j,i = 1 Γ j,i ∈ (0.5, 1) Γ j,i = 0.5 Γ i,j = 1 X i causes X j Γ i,j ∈ (0.5, 1) X j causes X i Common cause only Γ i,j = 0.5 No causal link Table 1 Causal relationship between a pair of random variables X i , X j following appropriately restricted linear SCM under different values of the causal tail coefficients. Blank entries and cases when Γ i,j < 0.5 or Γ j,i < 0.5 cannot occur (Gnecco et al, 2021, Theorem 1). where pa(j) ⊆ V = {1, . . . , p} denotes parents of j, ε 1 , . . . , ε p are jointly independent random variables and β j,k ∈ R \ {0} are the causal weights. We suppose that the associated graph G = (V, E), with the directed edge (i, j) present if and only if i ∈ pa(j), is a directed acyclic graph (DAG) with nodes V and edges E. We say that X i causes X j if a directed path exists from i to j in G. Structural causal models describe not only observational distributions but also distributions under interventions (or manipulations) on the variables. By intervening on X j , we understand a new SCM with p assignments, all of them identical with (1) except the assignment for X j (Pearl, 2009).
Let X i , X j be a pair of random variables from linear SCM. We assume that X i , X j are heavy-tailed with respective distributions F i , F j . The causal (upper) tail coefficient of X i on X j is defined in Gnecco et al (2021) as if the limit exists. This coefficient lies between zero and one and captures the causal influence of X i on X j in the upper tail since, intuitively, if X i has a monotonically increasing causal influence on X j , we expect Γ i,j to be close to unity. The coefficient is asymmetric, as extremes of X j need not lead to extremes of X i , and in that case, Γ j,i will be appreciably smaller than Γ i,j . In Section 1.2, we discuss the intuition of the choice of this coefficient in more detail (in the context of time series). Under certain assumptions on the tails of ε i and the linear SCM, the values of Γ i,j and Γ j,i allow us to discover the causal relationship between X i and X j . These relations are summarized in Table 1. Gnecco et al (2021) proposed a consistent non-parametric estimator of Γ i,j . Without loss of generality, consider i = 1 and j = 2. If (x 1,1 , x 1,2 ), . . . , (x n,1 , x n,2 ) are n ∈ N independent replicates of (X 1 , X 2 ), the estimator of Γ 1,2 is defined aŝ Γ 1,2 := 1 k n i=1F 2 (x i,2 )1(x i,1 > x (n−k+1),1 ), where x (n−k+1),1 is the k-th largest value of x 1,1 , . . . , x n,1 ,F 2 (t) = 1 n n i=1 1(x i,2 ≤ t), and 1(·) is the indicator function. The figure represents a sample realisation of (X, Y) ⊤ from Subsection 1.2 (X is the cause and Y is the effect). The delay represents the time delay between the time series, in this case, equal to 5. Pasche et al (2022) adapted the estimation process to account for potential confounding factors. The authors modified (2) by replacingF 2 (x) by a covariate-dependent estimator, where the upper tail ofF 2 (x) is modeled by a Pareto approximation (Coles, 2001). implemented a permutation test to formally assess the hypothesis that the causal tail coefficient is equal to 1.

Motivating example and the main idea in the context of time series
The following example illustrates a typical case considered in this paper. Let (X, Y) ⊤ = ((X t , Y t ) ⊤ , t ∈ Z) be a bivariate strictly stationary 1 time series described by the following recurrent relations where ε X t , ε Y t iid ∼ Pareto(1, 1) 2 . This scenario is depicted in Figure 1. Here, X causes Y (in Granger sense, see Definition 3 given later), simply because the 1 A stochastic process is strictly stationary if the joint distributions of n consecutive variables are time-invariant (e.g., Section 2.1.3 in Lütkepohl (2005)). We will not work with other stationarity types.
2 ε X t , ε Y t are iid (independent and identically distributed), following a Pareto distribution with parameters equal to 1. The distribution function of a Pareto(a, b) random variable is in the form F (x) = 1 − ( a x ) b for x ≥ a, zero otherwise. When a = b = 1, it is often called the standard Pareto distribution.
knowledge of X improves the prediction of Y. However, the converse is not true.
Consider the data shown in Figure 1. Our goal is to identify any causal relationship between these time series. There is (at least in this realization) an evident asymmetry between the two time series in the extremes. If the cause (X) is extremely large, then the effect (Y) will be also extremely large (see the second and third "jump"). However, if Y is extremely large, then X will not necessarily be extremely large (as evident in the first "jump"). This indicates that an extreme value of X tends to trigger an extreme value of Y, implying a causal link in an intuitive sense. Yet, a crucial factor is the presence of a time delay (or time lag). The extremes don't have to be concurrent -there's a time lag during which the influence of X on Y becomes apparent. In this specific example, this lag is exactly 5 time units.
To encapsulate this idea mathematically, we introduce the causal tail coefficient for time series where F X , F Y are the marginal distribution functions of X, Y, respectively. This coefficient mathematically expresses natural questions: Does an extreme value in X invariably lead to an extreme value in Y? How large Y will be in the next p steps if X is extremely large (in their respective scales)? In our example, we can consider p = 5. If X 0 is extremely large, then Y 5 will surely also be extremely large (large cause implies large effect), but not the other way around (large effect does not imply large cause). Hence intuitively, the following should hold: Γ time X→Y (p) = 1, but Γ time Y→X (p) < 1. The main part of the paper consists of determining the assumptions under which this is true.

Preliminaries and notation
We use bold capital letters to represent notation for time series and random vectors. A time series, or a random process, consists of a set of random variables Z = (Z t , t ∈ Z) defined on the same probability space. We exclusively work with time series defined over integers.
We will use the standard notion of regular variation (Resnick, 1987;Embrechts et al, 1997). A real random variable X is regularly varying with tail index θ > 0, if its distribution function has a form F X (x) = 1 − x −θ L(x) for some slowly varying function L, i.e., a function satisfying lim x→∞ L(αx) L(x) = 1 for every α > 0 (Kulik and Soulier, 2020, Section 1.3). This property is denoted by X ∼ RV(θ). Examples of regularly varying distributions include Pareto, Cauchy or Fréchet distributions, to name a few (Mikosch, 1999). For real The main principle that we aim to use is the so-called max-sum equivalence, that is, for two random variables X, Y we have P(X + Y > x) ∼ P(X > x) + P(Y > x) ∼ P(max(X, Y ) > x) as x → ∞. This is satisfied when X, Y iid ∼ RV(θ) (Embrechts et al, 1997, Section 1.3.1). Similar results hold even if we deal with finite or infinite sums of random variables (Resnick, 1987, Section 4.5).
Consider a bivariate process (X, Y) ⊤ = ((X t , Y t ) ⊤ , t ∈ Z). The concept of strong Granger causality (Berzuini et al, 2012, Chapter 22.2) can be expressed as follows: The process X is said to cause Y (denoted as X → Y) if Y t+1 is not independent with the past of X given all relevant variables in the universe up to time t except the past values of X; that is, where past(t) = (t, t − 1, t − 2, . . . ) and C t represents all relevant variables in the universe up to time t. The philosophical notion of C t is typically replaced by only a finite set of relevant variables. To illustrate with an example, given a three-dimensional process (X, Y, We have to note that such X has to be seen only as a potential cause, since enlarging the information set can lead to a change in the causal structure. In certain models, the definition of causality can be simplified (Palachy, 2019). For instance, Sims (1972) demonstrated that the Granger causality definition is equivalent to certain parameter restrictions within a linear framework. We provide the formal definition in Subsection 2.1 for a specific class of autoregressive models.

The causal tail coefficient for time series
The central concept introduced in this paper is the causal tail coefficient for time series Γ time X→Y (p) with extremal delay p ∈ N defined in (3). In scenarios where we exclude instantaneous effects (such as when X 0 directly causes Y 0 ), we can directly employ the following coefficient: where, as usual, . Furthermore, for any increasing functions h 1 and h 2 :

Models
In this paper, we work with two models of time series-VAR(q) process (vector autoregressive process of order q, Section 2.3.1 in Lütkepohl (2005)) and NAAR(q) model (nonlinear additive autoregressive model of order q). We now introduce the notation. Definition 1. We say that (X, Y) ⊤ = ((X t , Y t ) ⊤ , t ∈ Z) follows the bivariate VAR(q) model, if it has the following representation: where α i , β i , γ i , δ i ∈ R, i = 1, . . . , q, are real constants, and (ε X t , t ∈ Z), (ε Y t , t ∈ Z) are white noises. We say that it satisfies the stability condition, In this paper, our focus is on exploring whether X causes Y. It's also important to highlight that a situation where X causes Y and vice versa simultaneously is admissible.
Under the stability assumption, we can rewrite these time series using causal representations (Lütkepohl, 2005, Section 2.1.3): with suitable constants a i , b i , c i , d i ∈ R. Thus, X causes Y if and only if there exists i : d i ̸ = 0 (Kuersteiner, 2010). Now, we state our model assumptions. Definition 2 (Heavy-tailed VAR model). Let (X, Y) ⊤ follow the stable VAR(q) model as defined above with its causal representation given by (4). We introduce the assumptions: Under these assumptions, we refer to (X, Y) ⊤ as following the heavy-tailed VAR(q, θ) model (hVAR(q, θ) for short).
The first assumption is crucial, as it ensures the regular variation of our time series. The second assumption can be relaxed and can be replaced by the extremal causal condition discussed in Subsection 3.1. The third assumption is aimed at ensuring the a.s. summability of the sums ∞ i=0 a i ε X t−i . Moreover, this assumption guarantees the stationarity of X and Y. It also establishes a crucial max-sum equivalence relationship, namely P( (Mikosch and Samorodnitsky, 2000, Lemma A.3).
We consider a nonlinear generalization in order to minimize the assumptions on the body of the time series. First, we introduce the nonlinear counterpart of Definition 1. Definition 3. We define (X, Y) ⊤ = ((X t , Y t ) ⊤ , t ∈ Z) to follow the bivariate NAAR(q) model, specified by the equations: We state that X (Granger) causes Y if g 2 is a non-constant function on the support of X t−q (a.s.).
It's important to emphasize that our restrictions on the functions f 1 , f 2 , g 1 , g 2 are minimal in the body -we've only imposed conditions on their tails. By imposing the constraint lim x→∞ h(x) x < 1 for h = f 2 , g 2 , we ensure that the tail of f 2 (Y t−q ) (resp. g 2 (X t−q )) is not larger than the tail of Y t−q (resp. X t−q ). Assumption lim x→∞ h(x) x < 1 for h = f 1 , g 1 is closely related to the time series' stationarity. In the hNAAR models, our assumption of regular variation for the noise variables directly implies that X t , Y t ∼ RV(θ) (Yang and Hongzhi, 2005, Theorem 2.3). Moreover, our framework doesn't assume the existence of any moments; it accommodates cases like θ < 1, where the expectation E(X t ) does not exist.
Obviously, hNAAR(q, θ) models encompass a broader class of time series models compared to hVAR(q, θ) models. However, they are not nested. In the hNAAR(q, θ) case, we assumed that the X t and Y t are functions of only two previous values (X t being a function of X t−1 , Y t−q and Y t being a function of Y t−1 , X t−q ). In the VAR case, X t , Y t can depend on more than two previous values. Moreover, the NAAR case has an additional assumption ε X t , ε Y t ≥ 0. Nevertheless, up to these two differences, the class of hVAR(q, θ) models lies inside of the class of hNAAR(q, θ) models.

Causal direction
The subsequent two theorems constitute the core of this paper, connecting the classical concept of causality with causality in extreme events. Theorem 1. Let (X, Y) ⊤ be a bivariate time series which follows either the hVAR(q, θ) model or the hNAAR(q, θ) model. If X causes Y, then Γ time X→Y (q) = 1.
The proof can be found in Appendix B. Intriguingly, the regular variation condition is not used within the proof. We assume that we know the exact (correct) order q. Nevertheless, for every p ≥ q, we have Γ time X→Y (p) ≥ Γ time X→Y (q) = 1. The choice of an appropriate delay p will be discussed in Subsection 3.3. Theorem 2. Let (X, Y) ⊤ be a bivariate time series which follows either the hVAR(q, θ) model or the hNAAR(q, θ) model. If Y does not cause X, then Γ time Y→X (p) < 1 for all p ∈ N.
The proof can be found in Appendix B. The primary step of the proof stems from Proposition 2, presented in Appendix A. The core idea is that large sums of independent, regularly varying random variables tend to be driven by only a single large value. Consequently, if Y 0 is large, it could be attributed to the largeness of an ε Y i , which does not affect X. Note that distinct notation is employed for the time series order (denoted as q ∈ N) and the extremal delay (denoted as p ∈ N). Although these two coefficients need not be equivalent, Theorem 1 prompts our primary focus on scenarios where p ≥ q. Example 1. Consider the bivariate time series (X, Y) ⊤ described by the equations: where ε X t , ε Y t iid ∼ Pareto(1, 1), with a tail index θ = 1. A causal representation of this hVAR(1, 1) model is given by: In this case, the order is q = 1, and it is sufficient to take only as discussed in Subsection 3.3. Let us give some vague computation of this coefficient. From Theorem 1, we know that Γ time X→Y (1) = 1. For the other direction, rewrite First, note the following relations (the first follows from the independence and the second from Lemma 5 in Appendix A): Furthermore, we know that P( for every constant λ ∈ R (using Proposition 2 3 ). This implies that, with a probability of 1/2, X 1 |F Y (Y 0 ) > u has the same distribution as nonconditional X 1 , and with complementary probability, it tends to ∞ as u → 1 − . Combining these results, we obtain: The order q is usually unknown. If we take p = 2, then the value of will be slightly larger than 3 4 . More precisely, it will be equal to 1 2 · E [max{F X (X 0 ), F X (X 1 ), F X (X 2 )}] + 1 2 · 1. The true value is around 0.80, as determined through simulations and the use of computer software.

Properties and extensions
In this section, we exclusively focus on Heavy-tailed VAR models. While we believe that similar results also apply to Heavy-tailed NAAR models, the formulation and proof of such results are beyond the scope of this paper.

Real-valued coefficients
We discuss the extension to include potentially negative coefficients (as indicated by the second assumption in Definition 2) and non-directly proportional dependencies. Until now, we have assumed that a large positive X causes large positive Y. In other words, we have assumed that all coefficients in the hVAR model are non-negative.
The most straightforward modification arises when a large positive X causes large negative Y (where the gain for one causes a loss for others). In such cases, it suffices to consider the maxima of the pair (X, −Y). While we apply this simple modification in our application, it cannot be universally applied.
The concept behind extending the causal tail coefficient for time series involves utilizing the absolute values |X| and |Y| instead of X and Y. However, to implement this extension, we need to restrict the family of VAR(q) models in a specific manner. Theorems 1 and 2 do not hold if we allow arbitrary coefficients α i , β i , γ i , δ i ∈ R, whether positive or negative, where i = 1, . . . , q.
The following example illustrates a problematic scenario where Theorems 1 and 2 do not hold.
Example 2. Consider a vector (X, Y) ⊤ following a VAR(2) model defined as: Its causal representation can be written as Detecting extreme causal relationships can be challenging in this model. This is because, despite X being a cause of Y, it cannot be assumed that the extreme in X t−1 will necessarily result in an extreme in Y t . Even if X t−2 is large, leading to a large X t−1 , the same does not hold for Y t . To address this, we will impose a restriction on our model in the following manner. Definition 5 (Extremal causal condition). Consider the pair (X, Y) ⊤ , which follows the stable VAR(q) model, with its causal structure taking the form: Let X cause Y. We say that (X, Y) ⊤ satisfies an extremal causal condition, if there exists an integer r ≤ q such that, for every i ∈ N ∪ {0}: Lemma 1. The extremal causal condition holds in the hVAR(q, θ) models (i.e., where the coefficients are non-negative) when X causes Y. The proof can be found in Appendix B. Clearly, the aforementioned condition holds in models where d i ̸ = 0 for all i > 0.
The extremal causal condition embodies the following concept: Take k ≥ r. If ε X t is "extreme", this extreme event will also influence Y t+k . This implication is in particular true if some of the coefficients are negative. If ε X t is "extremely positive", this implies that Y t+k will be "extremely negative". Nevertheless, under the extremal causal condition, if ε X t is "extreme", Y t+k will also be "extreme".
Formulating similar conditions for NAAR models is challenging, since we would require that an appropriate combination of functions is non-zero. Due to the extremal causal condition, our focus in this section remains exclusively on VAR models, excluding NAAR models. The following theorem demonstrates that, even when negative coefficients are introduced, the principles outlined in Section 2 remain applicable, thanks to the extremal causal condition. Theorem 3. Let (X, Y) ⊤ be a time series which follows the hVAR(q, θ) model, with possibly negative coefficients, satisfying the extremal causal condition. Moreover, let ε X t , ε Y t have full support on R, and |ε X t |, |ε Y t | ∼ RV(θ). If X causes Y, and Y does not cause X, then Γ time |X|→|Y| (q) = 1, and Γ time |Y|→|X| (q) < 1.
The proof can be found in Appendix B.

Common cause and different tail behavior
Reichenbach's common cause principle (Reichenbach, 1956) states that for every given pair of random variables (X, Y), precisely one of the following statements holds true: X causes Y, Y causes X, they are independent or there exists Z causing both X and Y. The challenge lies in distinguishing between true causality and dependence stemming from a common cause. The subsequent theorem demonstrates that our methodology allows us to distinguish between causality and a correlation due to a common cause.
Let Z be a common cause of both X and Y, and neither X nor Y cause Z. If Y does not cause X, then Γ time Y→X (p) < 1 for all p ∈ N. The proof can be found in Appendix B. Theorem 4 makes an initial assumption that the tail indexes of X, Y, and Z are equal. However, it can be easily extended to the case when Z has lighter tails (see discussion in Section 4.5). In practical scenarios, complete observation of all pertinent data is often unattainable. Nevertheless, Theorem 4 remains valid even when we do not observe the common cause. However, the common cause still needs to fulfill the condition that noise is regularly varying with not heavier tails than those of X and Y. We can not check this assumption in practice. Example 3. Let (X, Y, Z) ⊤ follow the three-dimensional VAR(1) model, specified by Even within the context of the unconfounded scenario (or when ε Z t possesses tails that are lighter than those of ε X t and ε Y t ), a distinct challenge emerges when the tail behaviors of ε X and ε Y differ. Results from Section 2 remain applicable if ε X has lighter tails than ε Y . However, if ε X has heavier tails than ε Y , then Γ time X→Y (1) = Γ time Y→X (1) = 1 and we can not distinguish between the cause and the effect. A more detailed discussion can be found in Subsection 4.5.

Estimating the extremal delay p
Up to this point, we have assumed prior knowledge of the exact order q of our time series, with p being set equal to q. However, what if this order is unknown? In cases where p is too small, accurate causal relationships cannot be obtained (refer to Lemma 2 below). Conversely, as p becomes larger, Γ time Y→X (p) approaches 1, posing challenges for empirical inference.
One practical approach to address this challenge is to utilize the extremogram (Davis and Mikosch, 2009). Similar to the role of correlograms in classical cases, extremograms help us select a "reasonable" value for p by examining plots. However, we can examine the values of Γ time Y→X (p) for a range of values for p. An illustrative example can be found in Subsection 4.4. Now, let's delve into another consideration: the problem of estimating the temporal synchronization between two time series (X, Y) ⊤ , where X is the cause of Y. The objective is to determine the time required for information originating from X to impact Y. In the context of an intervention applied to X, the question is: when will this intervention influence Y? A practical instance of this scenario might arise in economics, such as when dealing with two time series representing milk and cheese prices over time. Consider a situation where the government imposes a large tax increase on milk prices at a specific moment. In such a case, when can we expect to observe a subsequent rise in cheese prices? Mathematically, we aim to estimate a parameter known as the minimal delay. Definition 6 (Minimal delay). Let (X, Y) ⊤ follow the stable VAR(q) model specified in Definition 1. We call s ∈ N the minimal delay, if γ 1 = · · · = γ s−1 = δ 1 = · · · = δ s−1 = 0 and either δ s ̸ = 0 or γ s ̸ = 0. If such s does not exist, we define the minimal delay as +∞.
The subsequent lemma establishes connections between the minimal delay and the causal tail coefficient for time series. Lemma 2. Let (X, Y) ⊤ follow the hVAR(q, θ) model, where X causes Y. Let s be the minimal delay. Then, Γ time X→Y (r) < 1 for all r < s, and Γ time X→Y (r) = 1 for all r ≥ s.
The proof can be found in Appendix B. An approach to estimating the minimal delay s involves identifying the smallest value of s for which Γ time X→Y (s) = 1.
Estimating the minimal delay s through extreme values holds relevance across various applications. One such application involves the assessment of synchrony between two time series. Synchrony, as described in (Cheong, 2020), quantifies the interdependence of two time series, seeking to determine an optimal lag that aligns the time series most closely. Conventional methodologies encompass techniques like time-lagged cross-correlation or dynamic time warping. Utilizing the causal tail coefficient for time series can offer an avenue to proceed using extreme values, facilitating the conception of synchronization of extremes. This may pave the way for future research and methodological developments.

Estimations and simulations
All the methodologies presented in this section have been implemented in the R programming language (R Core Team, 2022). The corresponding code can be accessed either in the supplementary package or through the online repository available at https://github.com/jurobodik/ Causality-in-extremes-of-time-series.git.

Non-parametric estimator
We propose an estimator of the causal tail coefficient Γ time X→Y (p) based on a finite sample (x 1 , y 1 ) ⊤ , . . . , (x n , y n ) ⊤ . The estimator uses only the values of y i where the corresponding x i surpasses a given threshold. Definition 7. We definê The value k signifies the number of extremes considered in the estimator. In the upcoming discussions, k will be contingent upon n, leading us to employ the notation k n instead of k. A fundamental condition in extreme value theory is expressed as: The next theorem shows that such a statistic is "reasonable" by showing that it is asymptotically unbiased. This result doesn't necessitate any stringent model assumptions; it only relies on the assumption of a certain rate of convergence for the empirical distribution function. Theorem 5. Let (X, Y) ⊤ = ((X t , Y t ) ⊤ , t ∈ Z) be a stationary bivariate time series, whose marginal distributions are absolutely continuous with support on some neighborhood of infinity. Let Γ time X→Y (p) exist. Let k n satisfy (5) and Under the assumptions of Theorem 5 and Theorem 1, the proposed estimator is consistent; that is,Γ time The proofs of Theorem 5 and Consequence 1 can be found in Appendix B. To establish asymptotic properties ofΓ time X→Y (p), it is reasonable to assume the consistency ofF X . However, this assumption is accompanied by an additional 5Γtime X→Y (p) depends on n, although we omitted this index for clarification. requirement for the rate of its convergence, as specified in condition (6). While this condition is not overly restrictive, we believe that the condition (6) can be improved. For iid random variables, condition (6) holds if k 2 n /n → ∞ (follows from the Dvoretzky-Kiefer-Wolfowitz inequality). The following lemma establishes that condition (6) is also fulfilled within a certain subset of autoregressive models. Lemma 3. Let X = (X t , t ∈ Z) has the form then the condition (6) is satisfied.
The proof can be found in Appendix B. It is based on Proposition 13 in Chen and Wu (2018). Several concentration bounds exist that describe the convergence speed of the empirical distribution functions, and different assumptions can be made to guarantee the validity of the condition (6). According to Theorem 1.2 in Kontorovich and Weiss (2014), condition (6) holds for geometrically ergodic Markov chains. Regrettably, this result was originally presented solely for N-valued random variables. The Dvoretzky-Kiefer-Wolfowitz inequality provides a similar outcome, but we did not come across a specific adaptation of this inequality for autoregressive time series.
Lemma 3 requires a finite second moment of ε i , a technical condition that follows from the findings in Chen and Wu (2018). In cases of heavy-tailed time series characterized by a tail index θ, this condition necessitates θ > 2. Notably, the coefficient β in Lemma 3 encapsulates the strength of dependence in the time series. Larger values of β correspond to faster decay of dependence, thereby allowing for milder assumptions on the intermediate sequence (k n ). Throughout the paper, we opt for the choice k n = √ n, a decision briefly explored in Subsection 4.3.

Some insight using simulations
We will simulate the behavior of the estimatesΓ time X→Y for a series of models. Initially, we employ the Monte Carlo principle to estimate the distributions of Γ time X→Y andΓ time Y→X for the following model.
where ε X t , ε Y t are independent noise variables and δ ∈ R. Figure 2 presents the histograms ofΓ time X→Y (2) andΓ time Y→X (2) from 1000 simulated instances of time series, each with a length of n = 5000, following Model 1. The parameter δ is set to 0.5, and ε X t and ε Y t are drawn from the Cauchy distribution. Figure 2 illustrates that within the causal direction, the values ofΓ time X→Y predominantly fall within the range of (0.9, 1). Conversely, in the non-causal direction, the values ofΓ time Y→X typically lie below 0.9. This underscores the asymmetry between the cause and the effect.
We proceed by simulating (X, Y) ⊤ according to Model 1. We explore three different values for the parameter δ (0.1, 0.5, and 0.9) and three distinct sample sizes (n = 100, 1000, and 10000). The random variables ε X t and ε Y t are generated from either the standard normal distribution (thus violating the RV assumption) or the standard Pareto distribution. For each value of δ, noise distribution, and sample size, we calculate the estimatorsΓ time X→Y :=Γ time X→Y (2). This procedure is repeated 200 times, and we compute the means and quantiles of these estimators. The summarized outcomes are presented in Table 2, where each cell corresponds to a specific model defined by δ, noise distribution, and sample size. Conclusion. The method's performance is unexpectedly robust even when the assumption of regular variation is violated. Although we lack theoretical justification for this observation, particularly with Gaussian noise where the Errors with standard Pareto distributions n = 100 n = 1000 n = 10000 10 Errors with standard Gaussian distributions n = 100 n = 1000 n = 10000 Table 2 The results obtained from 200 simulated time series following Model 1 where X causes Y. Each cell in the table corresponds to a distinct coefficient δ, a specific number of data-points n, and a particular noise distribution. The valueΓ time X→Y = · ± · represents the mean of all 200 estimated coefficientsΓ time X→Y , along with the difference between the 95% empirical quantile and the mean, based on all 200 simulations.
. This sub-asymptotic behavior reveals an inherent asymmetry. Further investigation is required to determine whether this observation holds as a general result.
Conversely, when the tails of the noise variables differ between the cause and the effect, our method does not perform well. If ε X t possesses heavier tails than ε Y t , for large n, both estimates tend to converge close to 1. Alternatively, if ε X t has lighter tails than ε Y t , bothΓ time X→Y andΓ time Y→X tend to be very small, significantly deviating from 1. Consequently, different tail behaviors of the noise variables appear to be the primary challenge. This issue is explored in greater detail in Subsection 4.5.

Choice of a threshold
A common challenge in extreme value theory is the selection of an appropriate threshold. In our case, this corresponds to the choice of the parameter k in Definition 7. There exists a trade-off between bias and variance; smaller values of k lead to reduced bias (but increased variance), and larger values of k result in higher bias (but lower variance). Unfortunately, there is no universally applicable method for threshold selection.
To provide insight into this behavior, consider Model 1 with n = 1000 and δ = 0.5. Figure 3 illustrates the estimatorsΓ time X→Y (2) andΓ time Y→X (2) using various values of k. The variance ofΓ time Y→X (2) for small k is very large. Conversely, as k increases, the bias ofΓ time X→Y (2) will increase. Based on this example and several others, the choice k = √ n appears to be a reasonable option. It is crucial to emphasize that while this choice may be pragmatic, it might not necessarily be optimal.
By visually evaluating Figure 3, it is possible to infer the values ofΓ time X→Y (2) andΓ time Y→X (2). The observation seems to be thatΓ time X→Y (2) approaches one as k decreases, while the same trend does not hold forΓ time Y→X (2).

Choice of the extremal delay
To provide an example of how Γ time X→Y (p) behaves for various selections of p, we consider the following model. Model 2. Let (X, Y) ⊤ follow the hVAR(6, 1) model where ε X t , ε Y t iid ∼ Cauchy. Notice that the minimal delay is equal to 6. Similar to the approach in Subsection 4.3, we simulate data from Model 2 with a size of n = 1000. Subsequently, we computeΓ time X→Y (p) andΓ time Y→X (p) for different values of p. This process is repeated 100 times. Figure 4 shows the mean, 5% and 95% empirical quantiles of these estimates.
The coefficientΓ time X→Y (p) in the correct causal direction rises much faster than in the other direction until it reaches the "correct" minimal delay. Following this, the coefficient approaches 1, remaining in close proximity even for larger p, aligning with theoretical expectations. Conversely,Γ time Y→X (p) experiences a slower ascent, gradually converging towards 1.
Nevertheless, practical scenarios often demand the selection of a specific p. In cases involving a small number of time series, like the application in Subsection 5.1, it's feasible to calculateΓ time X→Y (p) for several p choices and create visualizations, as exemplified in Figure 8 further ahead. In instances involving a greater number of time series, such as the application in Subsection 5.2, a pragmatic decision for p should align with the maximum expected physical time delay within the system. The challenge of selecting a suitable time delay is not exclusive to the extreme value context and is encountered in non-extreme scenarios as well (Hacker and Hatemi-J, 2008;Runge et al, 2019b). Fig. 4 The figure visualizes the behavior of the estimatorsΓ time X→Y (p) (in blue) and Γ time Y→X (p) (in red) for various selections of the extremal delay p. This is done using the VAR(6) model (Model 2) with n = 1000. The thick line represents the mean value across 100 realizations, while the thin lines correspond to the 5% and 95% empirical quantiles of these estimates

Hidden confounder and different tail behavior of the noise
The aforementioned examples have exclusively addressed scenarios where X causes Y. However, it is important to investigate how the methodology fares when the correlation stems from a shared confounder. Furthermore, how does it respond when the tail indexes of the noise variables differ? To address these inquiries, we intend to analyze the following time series.
The process Z represents an unobserved common cause, illustrating various scenarios: non-causal (δ X = δ Y = 0), X causing Y (δ X ̸ = 0), Y causing X (δ Y ̸ = 0), and bidirectional causality (δ X , δ Y ̸ = 0, notation X ↔ Y). Additionally, the tail indexes θ X , θ Y , and θ Z characterize the tail index of the time series. If θ X is large, then the distribution of ε X t is close to Gaussian. The case θ X < θ Y indicates that ε X t exhibits heavier tails than ε Y t . Simulating time series according to Model 3 with n = 1000, δ X , δ Y ∈ {0, 0.5, 1}, and various tail index combinations, we computeΓ time X→Y := Γ time X→Y (3). This procedure is repeated 500 times to compute means and empirical quantiles of the estimators. The differenceΓ time X→Y −Γ time Y→X is also calculated alongside 95% empirical quantiles. The outcomes are displayed in Table 3.
The results indicate that whenever eitherΓ time X→Y orΓ time Y→X is smaller than one, we can detect the correct causal direction. When X does not cause Y and vice versa,Γ time X→Y andΓ time Y→X are very similar and smaller than 1. When X causes Y,Γ time X→Y is generally much larger thanΓ time Y→X . When X ↔ Y, botĥ Γ time X→Y andΓ time Y→X closely approach 1. However, if ε Z t exhibits heavier tails than ε X t and ε Y t , bothΓ time X→Y andΓ time Y→X are in close proximity to 1, regardless of the true causal relationship between X and Y. Consequently, ifΓ time X→Y and Γ time Y→X are close to 1, we can not distinguish whether this is due X ↔ Y, or due to the presence of a hidden confounder with heavier tails.
An interesting scenario arises when θ X ̸ = θ Y . Theory suggests that the method should work well as long as θ X ≥ θ Y (heavier tails of ε Y t ) and should no longer work if θ X < θ Y (a supporting argument for this claim can be found in Appendix B, Claim 1). However, it's important to note that these results pertain to a population (limiting) context. For small sample sizes, the situation where θ X < θ Y appears to "work well", as evident from the blue values in Table 3 and Table 4. There is an evident asymmetry between the cause and Confounded case with no causality between X, Y (case δ     Table 3 The estimation ofΓ time (3) is performed on time series generated from Model 3. The tail indexes θ X , θ Y , and θ Z correspond to the tail index of the respective time series, where a large θ X signifies lighter tails and a small θ X denotes heavy tails in ε X . Each value ofΓ time X→Y = · ± · represents the mean of the estimated coefficientsΓ time X→Y , along with the difference between this mean and the 95% empirical quantile out of all 500 simulations. the effect, but it vanishes for large n. Both cases (θ X < θ Y and θ X > θ Y ) can create problems for the estimation part, and further investigation is needed to understand better the method's behavior under different tail indices.
To summarize, the most challenging situation occurs when a heavier-tailed noise variable acts as a common cause. This situation necessitates careful consideration, particularly when bothΓ time X→Y ,Γ time Y→X ≈ 1. This ambiguity can be attributed to either true bidirectional causality or the influence of a hidden confounder.

Testing
One approach to establish a formal methodology for detecting the causal direction between two time series involves the utilization of a threshold. Specifically, we say that X causes Y if and only ifΓ time X→Y (p) ≥ τ , where τ = 0.9 or 0.95. Our goal is to evaluate the hypothesis H 0 : Γ time X→Y (p) < 1 against the alternative H A : Γ time X→Y (p) = 1. An understanding of the distribution of Γ time X→Y (p) is imperative. However, computing this distribution lies beyond the purview of our current study. We considered estimating confidence intervals using the block bootstrap technique (Section 5.4 in Hesterberg (2014) and the accompanying code in the supplementary package). However, we opted not to incorporate this methodology within the confines of this paper, as its performance in our specific scenario proved to be suboptimal. Future research can reveal a better-working methodology. Additionally, visual aids such as Figures Granger test PCMCI LPCMCI n = 500 n = 5000 n = 500 n = 5000 n = 500 n = 5000 n = 500 n = 5000 Granger test PCMCI LPCMCI n = 500 n = 5000 n = 500 n = 5000 n = 500 n = 5000 n = 500 n = 5000 Table 4 A comparison of four methods for the causal inference on two time series models, a simple VAR(2) model and a more complex nonlinear model with a hidden common cause. The percentage obtained after 100 repetitions indicates how many times the estimation was correct.
3 and 4 can serve as valuable tools for enhancing understanding. Subsequently, we will delve into the scenario wherein we infer X → Y if and only if the value ofΓ time X→Y (p) meets or exceeds the designated threshold τ . In the following, we provide an insight into the process of selecting the appropriate threshold value τ . This choice should depend on the sample size n and the extremal delay p. Opting for a small τ runs the risk of yielding erroneous conclusions, while selecting a large τ can curtail the statistical power. Analogous methodologies are commonly employed, particularly within the realm of extreme value theory. For instance, in contexts such as distinguishing between extremal dependence and independence (Haan and Zhou, 2011). To develop an intuitive understanding, we will focus on Model 1 with δ = 0.5 and ε X t , ε Y t iid ∼ Pareto(1, 1). We will observe two objectives as a function of τ . The first is the percentage of correctly concluding that X → Y and the other is the percentage of correctly concluding that Y ̸ → X (that is, concluding that Y does not cause X).
Figures 5 and 6 illustrate the sensitivity of the results to the choice of the threshold τ = τ (n, p) across a range of (n, p) values. It is reasonable to expect that a larger value of p should correspond to a larger value of τ . To ensure that Γ time X→Y (p) < τ occurs in less than 5% of cases, simulations suggest values of τ (500, 6) = 0.90 and τ (500, 11) = 0.93. For larger values of p > 11,Γ time X→Y (p) andΓ time Y→X (p) exhibit substantial similarity. It appears that as n increases, the accuracy ofΓ time Y→X (p) improves. The 5% significance level seems consistent across all n values, with the main enhancement lying in the power to detect causality.
In conclusion, the appropriate selection of τ exhibits minimal variation with changes in n and p. In all cases, opting for τ ≈ 0.9 appears reasonable, unless p becomes exceedingly large.

Comparison with state-of-the-art methods
We conduct a comparative analysis of our method with three classical techniques found in the literature. The initial approach is the classical Granger  Fig. 5 The figures depict the performance of our methodology for various τ (500, p) selections in Model 3. In the left figure, we illustrate the percentage of instances where we accurately infer the relation X → Y. The right figure represents the percentage of cases in which we correctly identify the absence of the relation Y ̸ → X. If τ is large, we rarely find the correct relation X → Y, but we almost always correctly find Y ̸ → X. If τ is small, we almost always find the correct relation X → Y, but we almost never correctly find Y ̸ → X. A red horizontal line marks the 95% success rate. Note that the scenario where p = 1 signifies an erroneous selection of the extremal delay-specifically, a situation where p is smaller than the minimum delay. test utilizing default parameters and a significance level of α = 0.05. The second and third approaches encompass the PCMCI and LPCMCI methods, both employing default parameters and featuring a robust partial correlation independence test. For comprehensive insights into their implementations, please refer to the supplementary materials. Our method corresponds to estimatinĝ Γ time X→Y (3) and concluding that X causes Y if and only ifΓ time X→Y (3) ≥ τ with the choice τ = 0.9.
We examine two distinct models. The first is the VAR Model 1, characterized by δ = 0.5 and noise following a Pareto(1, 1) distribution. Notably, this noise exhibits an infinite expected value, which can potentially pose challenges for classical methodologies. The second is more complex NAAR model with a hidden confounder, as defined in the subsequent description. Model 4. Let (X, Y, Z) ⊤ follow the hNAAR(3, 1) model with independent noise variables ε X t , ε Y t , 2ε Z t ∼ Pareto(1, 1) and f X (x) = x 3 4 1(x > 50). The function f X in Model 4 represents a case when the causality presents itself only in the tail, not in the body of the distribution.
We conduct simulations on the aforementioned time series, considering two data set sizes: n = 500 and n = 5000. Subsequently, employing the four methodologies, we compute the count of simulations (out of 100) where the causal relationships X → Y and Y ̸ → X are accurately inferred. The outcome percentages are presented in Table 4.
The results suggest that state-of-the-art methods behave well within the context of the simple model. Although these methods inherently assume finite expected values, it appears that the Pareto(1, 1) noise distribution does not introduce significant challenges. Our method demonstrates satisfactory performance but exhibits slightly reduced power when compared with other approaches.
On the other hand, state-of-the-art methods encounter difficulties when confronted with the more complex model. PCMCI and LPCMCI frequently miscalculate the directionality, erroneously estimating that Y causes X. This misjudgment arises from the presence of a hidden common cause Z, which creates a spurious effect that even LPCMCI can not handle well. In contrast, our method successfully discerns genuine causality from causal connections attributed to a shared cause. To conclude, the outcomes presented in Table 4 imply that our method is particularly well-suited for large sample sizes, where it outperforms state-of-the-art techniques.

Applications
We apply our methodology to two real datasets. The first one pertains to geomagnetic storms in the field of space weather science, while the second one is related to earth science and explores causal connections between hydrometeorological phenomena. All data and a detailed R code are available in the supplementary package.

Space weather
In the subsequent sections, we address a problem within the realm of space weather studies. The term "space weather" refers to the variable conditions on the Sun and in space that can influence the performance of technology we use on Earth. Extreme space weather could potentially cause damage to critical infrastructures -especially the electric grid. In order to protect people and systems that might be at risk from space weather effects, we need to understand the causes of space weather 6 . Geomagnetic storms and substorms serve as indicators of geomagnetic activity. A substorm manifests as a sudden intensification and heightened motion of auroral arcs, potentially leading to magnetic field disturbances in the auroral zones of up to 1000 nT (Tesla units). The origin of this geomagnetic activity lies within the Sun itself. More precisely, a substantial correlation exists between this geomagnetic activity, the solar wind (a flow of negatively charged particles emitted by the Sun), and the interplanetary magnetic field (which constitutes a portion of the solar magnetic field carried outward by the solar wind).
A fundamental challenge in this field revolves around the determination and prediction of specific characteristics: the magnetic storm index (SYM) and the substorm index (AE). It may seem that AE is a driving factor (cause) of SYM because the accumulation of successive substorms usually precedes the occurrence of magnetic storms. However, a recent article (Manshour et al, 2021) suggests otherwise. It proposes that a vertical component of the interplanetary magnetic field (BZ) serves as a shared trigger for both indices. Our approach involves applying our methodology to validate this finding and ascertain whether this causal influence becomes evident in extreme scenarios.
Our dataset comprises three distinct time series (SYM, AE, BZ), encompassing approximately 100 000 measurements, each captured at 5-minute intervals over the entire year 2000. These data, in addition to being available in the supplementary package, can also be accessed online through the NASA website 7 . A visual representation of the data is presented in Figure 7. From the nature of the data, we focus on analyzing extremes characterized by extremely small SYM, extremely large AE, and extremely small BZ (that is, we consider −SYM, +AE, −BZ and compare maxima). Given the inherent characteristics of the data, an appropriate delay will be smaller than p = 24 (2 hours).
First, we discuss whether the assumptions for our method are fulfilled. We estimate the tail indexes of our data 8 . Results are the following: SYM has the estimated tail index 0.25 (0.015, 0.5), AE has 0.18 (0.08, 0.28) and BZ has 0.30 (0.12, 0.46). Therefore, the assumption of regular variation characterized by the same tail index appears reasonable. Furthermore, none of the confidence intervals encompass the zero value, although SYM comes quite close. This observation suggests that our time series can be deemed as regularly varying. We also require a stationarity assumption of our time series, which seems to hold. It is believed that our variables do not contain a seasonal pattern or a trend (at least not in a horizon of a few years). The Dickey-Fuller test also suggests stationarity, although it is recognized that tests for stationarity aren't very dependable (Dickey, 2005). In summation, all the assumptions seem to be reasonably satisfied.
Lastly, we proceed to calculate the causal tail coefficient across varying extreme delays p and a range of considered extremes k. The numerical outcomes are depicted in Figure 8. These findings impart the following insights: • BZ and SYM have strong asymmetry (Γ time BZ→SY M (p) ≈ 1, and Γ time SY M →BZ (p) ≪ 1, see blue lines), 7 NASA webpage https://cdaweb.gsfc.nasa.gov, accessed 18.5.2021. 8 We employ the HTailIndex function from the ExtremeRisks package in R with variable k = 500 (Padoan and Stupfler, 2020). The confidence intervals are computed using the normal approximation from Theorem 3.2.5 in Haan and Ferreira (2006) and page 1288 in Drees (2000). Details can be found in the corresponding R documentation. . These results align with the hypothesis that BZ is a common cause of SYM and AE, with no causal relation between them. Note that our methodology has the capability to accommodate scenarios involving a common cause, at least in theory. While conventional methods might suggest a causal relationship from AE to SYM, our analysis reveals instances of extreme events where AE exhibits extremity, yet SYM does not.

Hydrometeorology
Numerous significant phenomena have an impact on weather and climate, even though their importance and effects are not yet well understood. Within this application, we focus on a selection of the most intriguing phenomena to explore potential causal relationships among them. By delving into these relationships, we aim to gain insights into the intricate interplay among these variables, contributing to a deeper understanding of their influence on weather and climate dynamics.
• El Niño Southern Oscillation (ENSO) is a climate pattern that describes the unusual warming of surface waters in the eastern tropical Pacific Ocean. It has an impact on ocean temperatures, the speed and strength of ocean currents, the health of coastal fisheries, and local weather from Australia to South America and beyond. • North Atlantic Oscillation (NAO) is a weather phenomenon over the North Atlantic Ocean of fluctuations in the difference of atmospheric pressure. • Indian Dipole Index (DMI) (sometimes referred to as the Dipole Mode Index) is commonly measured by an index describing the difference between sea surface temperature anomalies in two specific regions of the tropical Indian Ocean. • Pacific Decadal Oscillation (PDO) is a recurring pattern of oceanatmosphere climate variability centred over the mid-latitude Pacific basin. • East Asian Summer Monsoon Index (EASMI) is defined as an area-averaged seasonally dynamically normalized seasonality at 850 hPa within the East Asian monsoon domain. • Amount of rainfall in a region in India (AoR) is a data set consisting of the amount of rainfall in the region in the centre of India. For all six variables, we possess monthly measurements starting from 1.1.1953. Following preliminary data manipulation, such as mitigating seasonality and ensuring data stationarity (for comprehensive specifics, please refer to the supplementary package), we turn our attention to assessing the assumption of heavy-tailness.
It seems that the assumption of identical tail indexes across all time series does not hold. Employing the same methodology as in the preceding subsection, we calculate the estimated tail indexes along with their corresponding confidence intervals, as presented in Table 5. It is evident that ENSO and AoR exhibit notably larger tail indexes in comparison to the other variables. On the other hand, variables NAO, DMI, PDO, and EASMI appear to share similar tail indexes (with NAO possibly having a slightly smaller tail index). Still, we will proceed keeping in mind that these relations between the first and second group may be misleading.
Finally, we proceed to compute the causal tail coefficients for each pair of time series. In this application, we conclude that X causes Y if and only if bothΓ time X→Y (p) > 0.9 andΓ time Y→X (p) < 0.8. This selection is motivated by the outcomes discussed in Subsection 4.5. Simulation results indicate that if both Γ time X→Y (p),Γ time Y→X (p) are close to 1, we can not distinguish between the case when X ↔ Y and when there exists a confounder with heavier tails causing both X and Y. In this application, the absence of a confounder with heavier tails cannot be assumed, given the distinct tail indexes observed in the relevant time series. Hence, if bothΓ time X→Y (p),Γ time Y→X (p) are large, we can not conclude any causal relation. Therefore, we are only interested in asymmetric extremal behavior. We chose the threshold 0.9 with the same reasoning discussed in Subsection 4.6. A similar line of thought guides the choice of the threshold 0.8: a suitable threshold ought to reside within the interval (0.7, 0.9), given that most coefficientsΓ time Y→X (p) from simulations (and the results of this application in Table 6) lie within this interval. Furthermore, the results remain the same if we chose any threshold in the interval (0.77, 0.86) since no arrows in Figure  9 would change. Consequently, the chosen threshold logically fits within this gap in the computed coefficients.
We opt to use an extremal delay of p = 12, equivalent to one year, for all pairs in our analysis 9 . The resulting causal relationships are outlined in Table 6. Subsequently, upon constructing the corresponding graph, we identify specific extremal causal relationships, as depicted in Figure 9. Here, directed arrows indicate the pairs with an apparent asymmetry in the correspondinĝ Γ time (p) coefficients (bold numbers in Table 6). The red arrows correspond to a pair of time series with significantly different tail indexes.
Concluding that the discovered causal relations are surely true would be a very bold statement. These highly complex datasets may not conform to a straightforward time series model. Their tail indexes are also different. Additionally, the absence of certain causal relationships doesn't imply their nonexistence; for instance, a causal connection might exist without exerting significant influence on extreme events. while this methodology doesn't offer definitive conclusions, it does identify asymmetries in extreme behavior and provides a preliminary notion of directions that might warrant further exploration.
As a means of comparison with existing research in the field of earth science, we offer a selection of pertinent results and articles that tend to concur with our findings. It's important to note that this list is not exhaustive.
• ENSO → EASMI (Pothapakula et al, 2020) (using information theory approach), (Chan and Zhou, 2005) Fig. 9 The estimated causal directions, determined using the causal tail coefficient for time series, involve the following variables: AoR (Amount of rainfall in India), DMI (Indian Dipole Index), PDO (Pacific Decadal Oscillation), NAO (North Atlantic Oscillation), EASMI (East Asian Summer Monsoon Index) and ENSO (El Niño Southern Oscillation). The presence of arrows denotes asymmetries in extreme events. The presence of arrows denotes asymmetries in extreme events. In cases where the corresponding tail indexes significantly differ (indicating that the assumptions are not satisfied), the arrow is depicted in red.
• ENSO → NAO (Mokhov and Smirnov, 2006) (using a phase dynamics modeling which can give different results than looking at extremes), • DMI → ENSO (Le et al, 2020), • DMI → EASMI (Pothapakula et al, 2020) (using information theory approach), • PDO → AoR (Sarkar et al, 2004) (although rigorous causal methodology was not used), • PDO → EASMI (Chan and Zhou, 2005) (although rigorous causal methodology was not used), • NAO is known to influence air temperature and precipitation in Europe, Northern America and a part of Asia (Wang et al, 2019). However, whether there are relations between NAO and other investigated variables is yet not known.
Our findings align with the majority of results found in the literature. However, there are three exceptions worth noting: our results suggest DMI → PDO (we couldn't find any reason for this causal relation in the literature). Also, our results do not suggest DMI → ENSO and ENSO → NAO, although these causal relations seem to hold (Mokhov and Smirnov, 2006;Le et al, 2020).

Conclusion
Causal inference in time series presents a recurring challenge within the literature. In numerous real-world scenarios, the causal mechanisms become apparent during extreme periods rather than within the bulk of the distribution. We introduce an innovative method that leverages extreme events for detecting causal relationships. This method is formalized within a mathematical framework, and several of its properties are systematically demonstrated under specific model assumptions.
This work sheds light on some connections between causality and extremes. Many scientific disciplines use causal inference as a baseline of their work. A method that can detect a causal direction in complex heavy-tailed datasets can be very useful in such domains.
This subject offers an extensive array of prospects for future research. For instance, what is the underlying distribution ofΓ time X→Y (p)? Could we devise a better (and consistent) statistic that mitigates the negative bias inherent inΓ time X→Y (p)? Is there a means to establish a rigorous testing methodology for the hypothesis Γ time X→Y (p) = 1? Furthermore, can we incorporate observed covariates into the causal inference between X and Y, moving beyond reliance solely on pairwise discovery? Does this method retain its efficacy even when applied to light-tailed time series? Can a similar approach be adapted to a more generalized framework? These questions can lead to potential future research and important results.

Conflict of interest and data availability
R code and the data are available in an online repository, or at a request to an author.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

A Auxiliary propositions
We repeatedly use the following property of regularly varying random variables (Breiman, 1965). If Z ∼ RV(θ), then for every α > 0 we have Proposition 1. Let X, Y, (ε i , i ∈ N) be independent continuous random variables with support on some neighborhood of infinity. Let a i , b i ≥ 0, i ∈ N and λ 1 , λ 2 ∈ R be constants. Then provided that the sums are a.s. summable, non-trivial.
Proposition 2. For i = 0, 1, 2, . . . , let for all λ ∈ R. We consider only those λ ∈ R such that P( ∞ i=0 a i ε X i < λ) > 0, otherwise the statement is trivial. Note that the first condition in Proposition 2 implies that all s. summable (Mikosch and Samorodnitsky, 2000).We prove this proposition using the following series of lemmas. Lemma 4. Let X, Y ∼ RV(θ) be independent random variables. Then , for every λ ∈ R. Lemma 5. Under the conditions from Proposition 2, for all n ∈ N. Lemma 6. Let Z ∼ RV(θ) be a random variable independent of (ε X i , i ∈ Z). Under the conditions from Proposition 2, for all n ∈ N.
Proof of Lemma 4. Using Bayes theorem, we obtain .
For the denominator we use the sum-equivalence P(X + Y > u) ∼ P(X > u) + P(Y > u). Therefore, it is sufficient to show that P(X + Y > u | X < λ) ∼ P(Y > u). Now, let W be a random variable independent of Y with a distribution satisfying P(W ≤ t) = P(X ≤ t | X < λ) for all t ∈ R. Then, P(X + Y > u | X < λ) = P(W + Y > u). We obviously have lim u→∞ P(W >u) P(Y >u) = 0 and we can use, e.g., Theorem 2.1 in Bingham et al (2006) to obtain lim u→∞ = 1. Therefore lim u→∞ P(X + Y > u | X < λ) = lim u→∞ P(Y > u), which we wanted to prove.
Proof of Lemma 5. Without loss of generality Φ = ∅, otherwise we have only lower n. Denote ω = min i≤n a i , it holds that ω > 0. In this proof only, we denote B = n i=0 b i , and A = n i=0 a i . The following events relation are valid: (Simply put, there needs to be one large and one small ε X · ). Therefore, we can rewrite Proof of Lemma 6. Without loss of generality Φ = ∅, otherwise we have only lower n. The proof is very similar to that of Proposition 1. In this proof only, we denote B = n i=0 b i , and A = n i=0 a i . Let W be a random variable independent of Z with a distribution satisfying P(W ≤ t) = P( for all t ∈ R. Then, P(X +Y > u | X < λ) = P(W + Y > u). Using Bayes theorem, we have .
In the last equality, we used the fact that W does not have a heavier tail than Z and therefore we can use the sum-equivalence.
All we need to prove is that lim u→∞ P(W >u) P( n i=0 biε X i >u) = 0. Again, using Bayes theorem, we obtain The rest follows from Lemma 5.
Proof of Proposition 2. Let δ > 0, define ζ = 1 − √ 1 − δ > 0 10 and choose large n 0 ∈ N such that the following conditions hold: Then, E, F, Z and also G, H, Z are pairwise independent. Sum-max equivalence gives us P( . Hence, with our notation, we want to First, due to Lemma 4, Second, using previous results and independence of F and (G, Z), we obtain Finally, we obtain (the first inequality is trivial; the second uses the identity P(A∩B) ≥ P(A)−P(B c ); the third uses the previous result; the equality follows from Lemma 6; the next two inequalities follow from the sum-equivalence and trivial P(H > u) ≥ 0; the next is trivial and the last inequality follows from P(F + δ > 0) > 1 − δ and independence of E, F ): When we send δ → 0, we finally obtain , which we wanted to show. The inequality in the opposite direction can be done analogously.
Consequence 2. Under the conditions from Proposition 2, Proof. The proof is analogous as that of Proposition 2. Modified Lemma 4 and Lemma 6 are still valid, just with |ε X i | instead of ε X i in the equations. Modification for Lemma 5 is trivial, because The limiting argument for n → ∞ remains the same.
We have used the sum-equivalence, independence and the previous equation. Therefore, we have lim u→∞ P(Xt>u) Proof of Proposition 3. Find c < 1, K ∈ R such that for all x > 0 we have f (x) < K + cx, g 1 (x) < K + cx, g 2 (x) < K + cx.
Finally, because X i and ε Y j are all independent, where we used regular variation property, sum-equivalence, and Lemma 7.
Remark. We proved a stronger claim. We showed that for every hNAAR(q, θ) model, there exist a stable VAR(q) sequence which is a.s. larger. Note that the VAR(q) process defined by with 0 < a, b, d < 1, is stable.

B Proofs
Observation: Let X, Y be continuous random variables with support on some neighborhood of infinity, and F X , F Y their distribution functions. Then, if and only if lim v→∞ P(Y > λ | X > v) = 1 for every λ ∈ R. Theorem 1. Let (X, Y) ⊤ be a bivariate time series which follows either the hVAR(q, θ) model or the hNAAR(q, θ) model. If X causes Y, then Γ time X→Y (q) = 1.
Proof for hVAR(q, θ) model. Since X causes Y, we get δ r > 0 for some r ≤ q. Then Now, using causal representation, we can write where we used Proposition 1 in the last step. Therefore, lim v→∞ P(Y r > λ | X 0 > v) ≥ 1, which proves the theorem.
Proof for hNAAR(q, θ) model. We proceed very similarly as in the proof of hVAR(q, θ) model. We rewrite Γ time Since X causes Y, it holds that g 2 is not constant and lim x→∞ g 2 (x) = ∞. This implies that there exists x 0 ∈ R such that g 2 (x) > λ for all λ ∈ R. Therefore, for all u > x 0 , Finally, we only use the fact that ε Y t and g 1 are non-negative. Hence, which we wanted to prove.
Theorem 2. Let (X, Y) ⊤ be a bivariate time series which follows either the hVAR(q, θ) model or the hNAAR(q, θ) model. If Y does not cause X, then Γ time Y→X (p) < 1 for all p ∈ N.
Proof for hVAR(q, θ) model. Let λ ∈ R be such that P(X 0 < λ) > 0. We show that Now, we use the causal representation of the time series, which, because we know that Y does not cause X, can be written in the form for ϕ i = a i + · · · + a i−p (we define a j = 0 for j < 0). Finally, it follows from the consequence of Proposition 2 that which we wanted to prove (Proposition 2 requires non-trivial sums, but if ∀i : d i = 0 then the series are independent and this inequality holds trivially).
Proof for hNAAR(q, θ) model. We have Choose large λ ∈ R, such that sup x<λ f (x) < λ and such that 11 Then, as in the proof for the Heavy-tailed VAR model case, if we show that this is strictly larger than 0, it will imply that Γ time Y→X (q) < 1. We know that for every i ≥ 1 the following holds We only need to show for the case when i = 0 that lim v→∞ P(X 0 > λ | Y 0 > v) < 1. Let Z = g 1 (Y −1 ) + g 2 (X −q ), Z is independent with ε Y 0 . After rewriting, we obtain .
On the other hand, if lim v→∞ P(Z>v) P(ε Y 0 >v) = c ∈ R + , we also have that Z ∼ RV(θ) (this follows trivially from the definition of regular variation, tails behavior is the same up to a constant). Therefore, we can apply the sum-equivalence and we obtain which is less than 1 for δ close enough to 1. Therefore, we obtain lim v→∞ P(X 0 > λ | Y 0 > v) < 1, which we wanted to prove.
Lemma 1. The extremal causal condition holds in the hVAR(q, θ) model (i.e., where the coefficients are non-negative) when X causes Y.
Proof. In the notion of Definition 2 and Theorem 2, if δ p > 0, then Theorem 3. Let (X, Y) ⊤ be a time series which follows the hVAR(q, θ) model, with possibly negative coefficients, satisfying the extremal causal condition. Moreover, let ε X t , ε Y t have full support on R, and |ε X t |, |ε Y t | ∼ RV(θ). If X causes Y, but Y does not cause X, then Γ time |X|→|Y| (q) = 1, and Γ time |Y|→|X| (q) < 1.
Proof. First, we show that if Y does not cause X, then Γ time |Y|→|X| (q) < 1. This holds even without the extremal causal condition. Similarly as in the proof of Theorem 2, it is sufficient to show that ∃λ > 0 : We use the following fact. Since we assumed that ε X i and |ε X i | are RV(θ) 12 , the following holds: see, e.g., (Jessen and Mikosch, 2006, Lemma 3.5). The second step follows simply from the max-sum equivalence. Finally, we use this fact and the triangle This is for v → ∞ less than 1 due to the classical non-negative case from Proposition 2 (for any λ ∈ R such that P(| ∞ i=0 a i ε X t−i | > λ) < 1). Second, we show that if X causes Y, then Γ time |X|→|Y| (q) = 1. Similarly, as in the proof of Theorem 1, it is sufficient to show that lim v→∞ P(|Y r | < λ | |X 0 | > u) = 0 for every λ ∈ R. Here, r ≤ q is some index with δ r ̸ = 0. Using the causal representation with the same notation as in the proof of Theorem 1, where we used the same argument as in the first part of the proof. Therefore, we simplified our model and obtained the classical non-negative case. The result follows from the previous theory. Using Lemma 5 we obtain the result for finite n, because Φ = ∅ due to the extremal causal condition. The argument for limiting case n → ∞ follows the same steps as those in the proof of Proposition 2.
Theorem 4. Let (X, Y, Z) ⊤ = ((X t , Y t , Z t ) ⊤ , t ∈ Z) follow the threedimensional stable VAR(q) model, with iid regularly varying noise variables. Let Z be a common cause of both X and Y, and neither X nor Y cause Z. If Y does not cause X, then Γ time Y→X (p) < 1 for all p ∈ N.
Proof. Let our series have the following representation: Just as in the proof of Theorem 2, it is sufficient to show that lim v→∞ P(X p > λ|Y 0 > v) < 1 for some λ > 0. After rewriting, which follows from Proposition 2 (two countable sums can be written as one countable sum).
Lemma 2. Let (X, Y) ⊤ follow the hVAR(q, θ) model, where X causes Y. Let s be the minimal delay. Then, Γ time X→Y (r) < 1 for all r < s, and Γ time X→Y (r) = 1 for all r ≥ s.
Proof. Proving that Γ time X→Y (r) = 1 for all r ≥ s, is a trivial consequence of the proof of Theorem 1 (in the first row of the proof, instead of choosing some s ≤ q : δ s > 0, we choose s to be the minimal delay).
Concerning the first part, we only need to prove that Γ time X→Y (s − 1) < 1, because then also Γ time X→Y (s−i) ≤ Γ time X→Y (s−1) < 1. As in the proof of Theorem 2, we only need to show that lim v→∞ P(Y s−1 < λ|X 0 > v) > 0 for some λ > 0. By rewriting to its causal representation, we obtain the following relation We only need to realize that d i = 0 for i ∈ {1, . . . , s − 1} because s is the minimal delay. Therefore, ε X 0 is independent of Y s−1 and the rest follows from Proposition 2 (where we deal with the two sums as one, and single ε X 0 is the second "sum").
Theorem 5. Let (X, Y) ⊤ = ((X t , Y t ) ⊤ , t ∈ Z) be a stationary bivariate time series, whose marginal distributions are absolutely continuous with support on some neighborhood of infinity. Let Γ time X→Y (p) exist. Let k n satisfy (5) and n k n P n k n sup x∈R |F X (x) − F (x)| > δ n→∞ −→ 0, ∀δ > 0.
Then, EΓ time X→Y (p) n→∞ → Γ time X→Y (p). Proof. Throughout the proof, we use the fact that for a continuous X 1 always holds P(F X (X 1 ) ≤ t) = t for t ∈ [0, 1] and the fact that follows from the stationarity P(F X (X 1 ) ≤ k n ) = P(X 1 ≤ X (k) ) = k n , for k ≤ n, k ∈ N. Please note that X (k) is always meant with respect to (not written) index n.
Note that these two equations differ only in the first term. Therefore, to show (9), we only need to show that n k n Ω Z·1(X (n−kn) > X 1 > u n ) dP− n k n Ω Z·1(u n > X 1 > X (n−kn) ) dP n→∞ → 0.
We show that the first term goes to 0. Analogously, the second term can be shown to converge to 0.
Proof. The result is a slight modification of Proposition 13 in Chen and Wu (2018). The proposition states that, under our conditions, there exist α > 1/2 and n 0 ∈ N such that for all n > n 0 and for all z ≥ γ √ n(log n) α holds P(n sup x∈R |F X (x) − F X (x)|/f ⋆ > z) ≤ constant n z qβ (log z) r0 .
Choose δ > 0 and take z = f ⋆ δ k n . Note that z ≥ γ √ n(log n) α for large n due to (7). Rewrite (10) into n k n P( n k n sup x∈R |F X (x) − F X (x)| > δ) ≤ constant n k qβ n (log f ⋆ δ k n ) r0 n k n .
Since n k qβ n n kn → 0 as n → ∞ due to (7), we obtain that the condition (6) is satisfied.
Claim 1. Under the setup in Section 4.5, where θ Z ≥ θ X , θ Y , the following implications hold: • Y does not cause X and θ X ≥ θ Y =⇒ Γ time Y→X (p) < 1 for all p ∈ N. • X causes Y and θ X , θ Y > 0 =⇒ Γ time X→Y (q) = 1. On the other hand, if θ X < θ Y , then Γ time Y→X (q) = 1 can happen even if Y does not cause X. Hence, the results from Subsection 2.2 are still valid as long as θ X ≥ θ Y , while they might no longer be true if θ X < θ Y .
We do not provide rigorous proof of this claim. However, we give here some simple intuition why we believe it is true. Γ time Y→X (q) will be even smaller than in the case when θ X = θ Y , since the effect of X on Y in extremes will be much smaller if X has lighter tails than Y. To make this more rigorous, it is possible to rewrite Proposition 2 with unequal tail indexes and claim that for all λ ∈ R, where the notation follows Proposition 2. The proof of Theorem 2 would follow the same steps with modified Proposition 2. As for the second bullet-point, the proof of Theorem 1 does not use the regular variation condition. Consequently, if X → Y, we deduce that Γ time X→Y (q) = 1, irrespective of the tail indexes.