Importance Sampling of Rare Events in Chaotic Systems

Finding and sampling rare trajectories in dynamical systems is a difficult computational task underlying numerous problems and applications. In this paper we show how to construct Metropolis- Hastings Monte Carlo methods that can efficiently sample rare trajectories in the (extremely rough) phase space of chaotic systems. As examples of our general framework we compute the distribution of finite-time Lyapunov exponents (in different chaotic maps) and the distribution of escape times (in transient-chaos problems). Our methods sample exponentially rare states in polynomial number of samples (in both low- and high-dimensional systems). An open-source software that implements our algorithms and reproduces our results can be found in https://github.com/jorgecarleitao/chaospp


I. INTRODUCTION
Extreme events play a crucial role in our society. Landslides, floods, meteorite collisions, solar flares, earthquakes are all events that are rare but often lead to catastrophic consequences to our well being [2]. Science often studies extreme events by recreating the process that generates them sufficiently many times. While in some cases the process can be reproduced experimentally, in many others (e.g., astronomical and weather events) the best we can do is to simulate physical models.
Physical models often contain non-linearities in the equations of motion that amplify fluctuations and give origin to extreme events. A famous effect of non-linearity is chaos, the extremely sensitive dependence on initial conditions. Chaos has a fascinating history dating back to the seminal work of Poincaré in 1890 [3] and is today a well established field of research with applications in Biology, Geology, Economy, Chemistry, and Physics [4,5]. Chaotic dynamics often hinders our ability to study the evolution of the system analytically, e.g. it forbids describing trajectories in a closed formula. In this situation, again, often the best we can do is to numerically simulate the model in a computer, a paradigm popularized by Lorenz since the 1960s [4,6]. Extreme events are then studied statistically, over an ensemble of initial conditions.
In this paper we introduce a framework for performing numerical simulations in chaotic systems in such a way that rare trajectories are generated more likely than if they would be generated by chance. This is an importance-sampling [7,8] (or rare-event-simulation [9]) strategy that builds on methods that have proven successful in the characterization of rare configurations in various problems. The distinguishing feature of our framework is that it considers general classes of observables [10] in deterministic chaotic [5,11] systems, being therefore able to find and sample initial conditions leading to extreme events in different problems. We apply our framework to two traditional problems in the numerical exploration of chaotic systems: • trajectories with high or low finite-time Lyapunov exponents; • long-living trajectories in open systems.
The results shown in this paper are general, but simulations are done in simple dynamical systems (time-discrete maps with up to 16 dimensions). Our goal is to illustrate the generic computational challenges of sampling rare events in chaotic systems, in line with the tradition that simple systems often possess the basic mechanisms responsible for extreme events. The importance of rare trajectories in chaotic systems, including methods designed to find and sample them, have been studied through different perspectives [10][11][12][13][14][15][16][17][18][19][20]. For example, the method stagger and dagger, used to find long-living trajectories of transiently chaotic systems, has been an important tool to characterize chaotic saddles [11,12]. Lyapunov weighted dynamics is a population Monte Carlo method that was successfully applied to find trajectories with atypically low or high chaoticity [16,21]. Similar genealogical particle analysis was applied to compute rare events in simple climate models [20]. Another powerful and widely used method within trajectory sampling is transition path sampling [14,15], that has been used to sample rare trajectories (e.g. chemical reactions), typically influenced by thermal noise [15]. These are typically trajectories that transit from one stable configuration to another stable configuration [15,22]. These different methods achieve their goal through different, often ingenious, solutions.
Here we aim to construct methods that are applicable to different classes of problems and to quantitatively understand the impact of different parameters and choices on the efficiency of the algorithm. We then show that some of the existing methods can be derived from our construction under suitable approximations.
Our framework relies on Metropolis-Hastings (MH) importance sampling [7,8], a well established numerical technique that has been used in statistical physics to study rare events since the 1950s. MH produces a random walk x → x in the phase-space of the system that generates initial conditions leading to extreme events (rare states) more often than they would be found by chance, consequently reducing the computational cost associated with their rareness. The flexibility of MH is confirmed by its success in numerous fields in Physics, Chemistry, Finance, among many others [7,8]. While transition path sampling [15], Lyapunov weighted dynamics [16], and genealogical methods [20] already use importance sampling in deterministic chaotic systems, three fundamental questions remains largely open: 1. can MH be used to systematically sample rare trajectories of (deterministic) chaotic systems? If yes, 2. how and 3. at what (computational) cost?
To answer these questions, we develop a systematic approach to apply MH in chaotic systems. The crucial step is the choice of the proposal distribution of the MH algorithm, the conditional probability of "trying" a state x in the phase-space of the system, given the current state x. The question we have to answer in order for MH to work is: what proposal distribution guarantees that an observable of the trajectory starting at x , e.g. its Lyapunov exponent, is similar to the same observable of the trajectory starting at x? Our main contribution is a methodology to answer this question for a broad class of chaotic systems and observables. More specifically, we show how incorporating properties of trajectories of chaotic systems in the proposal is a necessary condition to obtain an efficient MH algorithm. This methodology allows to construct efficient MH algorithms to sample rare events in different problems and classes of chaotic systems. We expect the ideas and formalism presented here to find applications in other problems of chaotic systems and in the study of extreme events more generally. Therefore, we expect our results to be useful both to those studying extreme events in non-linear systems and to those studying numerical techniques.
The paper is organized as follows: Sec. II reviews traditional numerical problems in chaotic systems and shows how they can be formulated as a problem of sampling rare trajectories; Sec. III introduces the MH algorithm as a method to perform importance-sampling simulations in chaotic systems, and shows that naive approaches do not lead to an efficient MH; Sec. IV shows how to incorporate general features of chaotic systems, such as exponential divergence or self-similarity of some of its properties, in the proposal distribution in order to achieve an efficient MH; Sec. V presents numerical tests of the general framework that confirm the applicability of Monte Carlo algorithms to sample rare events in different classes of chaotic systems; Sec. VI summarizes our results and discusses its implications.

II. REVIEW OF PROBLEMS
We consider dynamical systems whose states x in a phase space Ω, x ∈ Ω ⊂ R D , evolve in time from an initial condition x = x 0 according to where F (x) ∈ Ω and F t is F composed t times, F t (x) ≡ F (F (...F (x)...)). Such discrete-time systems can be obtained from continuous-time systems through a Poincare surface of section, a stroboscopic projection, or by the time discretization of the numerical integration of differential equations. We are also interested in the dynamics in the tangent space, which quantifies the divergence between two initial conditions [5]. Specifically, the distance of a state displaced from x by h, x = x + h, to the original state x evolves in time according to Expanding F t (x ) around F t (x) allows x t − x t to be written as where J t (x) ≡ dF t (x)/dx is the Jacobian matrix of F t , and ∂ 2 F t (x)/(∂x i ∂x j ) is the (i, j) entry of the Hessian matrix of F . The first term of Eq. 3 can be expanded using the derivative of the composition and be written as where J ≡ J 1 and h t evolves in the tangent space according to For small |h|, the growth of x t − x t is characterized by the eigenvalues and eigenvectors of J t [23]. The largest finite-time Lyapunov exponent (FTLE) of a point x can be defined [24] as where µ 1 is the real part of the largest eigenvalue of J t . Thus, when at least one direction is unstable (µ 1 > 1), x t − x t increases exponentially with time, and at most by When the system is one dimensional, the "Jacobian matrix" is a single number, the product in Eq. 4 is a product of numbers, and the only "eigenvalue" is the result of this product. Thus, in this case Eq. 6 can be written as We now describe two computational problems in chaotic dynamical systems.
A. Variability of trajectories' chaoticity For a chaotic system, λ L ≡ λ t→∞ > 0 in Eq. 6. The variation of the (maximum) finite-time Lyapunov exponents λ t (FTLE) across different initial conditions characterizes the variation of chaoticity of the system, yielding a distribution of the FTLE, P (λ to ), or, equivalently, the distribution of E ≡ t o λ to : where U (x) is a chosen probability distribution (e.g. uniform in the phase-space). The FTLE and its distribution was introduced in the 80s [25][26][27] and has been used to study turbulent flows [28], Hamiltonian dynamics [29], chimera states [30], characterize dynamical trapping [25,[31][32][33][34], among others [35,36]. The distribution of FTLE is related to the generalized dimensions [37] and often follows a large deviation principle, where t o is the extensive parameter [18,37]. In strongly chaotic systems, the distribution of FTLE is Gaussian [37], whereas for intermittent chaos and other weakly chaotic systems the distribution is typically non-Gaussian [38]. Figure 1 shows the phase-space dependency and distribution of the FTLE in one chaotic system. It contains characteristic features observed in strongly chaotic systems: λ t is a quantity that depends sensitively on the state x, and its distribution P (λ t ) decays exponentially to zero on both sides. In the limit t o → ∞, P (λ to ) → δ(λ to − λ L ).
The tails of the distribution P (E) play a significant role in the characterization of chaotic systems, as, for example, the higher moments of the distribution are related to higher qs in the generalized dimensions D q of the attrac-tor [37]. Furthermore, the regions of the phase-space with small (large) finite-time Lyapunov exponent are associated with slow (fast) decay of correlations [27], and their characterization has been used to get insight on whether the system is ergodic or not [29]. Moreover, trajectories characterized by a low or high finite-time Lyapunov exponent can play a significant role in the dynamics of interfaces in chaotic flows [28] and others [16].
A typical analysis of the FTLE is to measure how a quantity, W (x), depends on the FTLE λ to [27][28][29]. This requires estimating an integral of the form where W (x) is the pre-selected quantity, Γ is a preselected sampling region (often Γ = Ω), and U (x) is the weight attributed to x (often uniform, U (x) = 1/|Γ|). Let us look for two examples. First, consider the problem of estimating the distribution P (λ to ), e.g.
Refs. [16,27,29,38]. The traditional technique is to sample M states x according to U (x) = 1/|Γ| and estimate P (λ to ) using the estimator M λ /M , where M λ is the number of samples with λ to (x) ∈ [λ to , λ to + ∆λ to ]. Formally, this corresponds to W (λ to ) with W (x) = 1. The second example is retrieved from Ref. [27]. There, in order to evaluate the contribution of the algebraic region of the phase-space to the power spectra, the authors decomposed it in two terms corresponding to the power spectra of trajectories with low FTLE and trajectories with high FTLE. This required estimating the power spectra from a set of trajectories conditioned to an interval of λs on the tails of the distribution. Associating the power spec- where f is the frequency, the power spectra represented in Fig. 4 of Ref. [27] corresponds to integrals (for different frequencies f ) given by which contains integrals of the form of Eq. 10.
Numerically estimating the integral in Eq. 10 is challenging for two reasons: first, P (λ to ) decays with t o (lower panel of Fig. 1): the distribution of FTLE often follows a large deviation principle, P (λ to (x)) ∝ exp (t o s(λ to )), where s is intensive in respect to t o and is often concave [37]. Consequently, the traditional methodology of sampling states uniformly to find or sample states with increasing t o requires an exponentially high number of initial conditions. Second, the dependency of λ to (x) on x has multiple local minima and maxima (upper panel of Fig. 1). Such rough (fractal [39]) landscapes are known to challenge numerical techniques (e.g., simulations get trapped in local minima or maxima) [40].
The problem of finding and sampling states with high or low-λ has been addressed in the literature with numerical techniques that go beyond traditional uniform sampling [16][17][18]. Such techniques have been successfully applied to find [16] and sample [17,18] states with extremal λs in different chaotic systems. Refs. [16,18] use a population Monte Carlo canonical ensemble where λ t plays the role of the energy E to find or sample states with high or low FTLE. The method computes stochastic trajectories that, from the numerical tests performed, are indistinguishable from (deterministic) trajectories of the system. Ref. [17] proposes a flat-histogram simulation to find high or low chaotic states by developing an observable to quantify the chaoticity of the state.

B. Transient chaos
The best known examples of chaotic systems have a fractal attractor in the phase space [5], the Lorenz attractor being the most prominent example [6]. However, chaotic dynamics can appear also when the fractal invariant set in the phase space is not purely attracting, e.g. it may have stable and unstable directions (a saddle). Trajectories close to this set perform chaotic motion for an arbitrarily long (yet finite) time. This phenomenon appears in a variety of physical systems and is known as transient chaos [41,42].
Numerical investigations of transiently-chaotic systems are computationally difficult because most trajectories quickly escape the vicinity of the chaotic saddle (on which the chaotic dynamics is properly defined) [12,13,43]. More precisely, the escape time of a state x in a preselected region x ∈ Γ ⊂ Ω is defined as the first passage time of its trajectory to a pre-selected exit set Λ ⊂ Ω: Almost all trajectories start at Γ and eventually leave when x t ∈ Λ, but they do so at different times t e . Such variability is quantified by the distribution of escape time: the probability that a random initial condition x leaves at a given time t e , where δ te,te(x) is the Kronecker delta and U (x) is the probability density assigned to each state in Γ, which is often constant, U (x) = 1/ Γ dx ≡ 1/|Γ|. Figure 2 shows the typical features of t e (x): it strongly depends on the initial state x and its distribution P (t e ) decays exponentially to zero (i.e., it has a constant escape rate κ).
There are two main numerical techniques to study transiently chaotic systems. The first is to find one long living state x with t e (x) 1 and to compute an average over states of this trajectory [11,12]. For large t e (x), the trajectory between times [t e (x)/2 − t s , t e (x)/2 + t s ] where t s t e (x)/2 is close to a trajectory on the chaotic saddle. When the saddle is ergodic, an average over this long trajectory corresponds to an average over the natural invariant density and therefore an average over these  (2) the Standard Map, both defined in Appendix VII A. P (te = t) was computed by uniformly drawing 10 6 states x ∈ Γ, and measure the relative number of times that te(x) = te.
states characterizes invariant properties of the system. The second technique, which we focus in this work, is to compute averages over an ensemble U (x) of initial conditions x that leave the system at time t e (x) = t e [11,42]. For small t e observations depend on the particular initial density U (x) (a point of interest in itself [42]), while for large t they characterize invariant properties of the system (like in the previous technique, the states F t/2 (x) with t → ∞ are independent samples of the natural invariant density). A typical analysis within sampling transiently chaotic systems is to measure how a quantity, W (x), changes with increasing escape time t e . Numerically, this can be written as Let us enumerate 3 examples of computational problems that can be interpreted as numerical estimations of an integral of the form of W (t e ) in Eq. 14. a. Compute the escape time distribution: Numerically, P (t e ) is often computed by drawing states from U (x) (e.g. uniform density), and counting the relative number of states that exited at escape time t e [11]. This corresponds to W (x) = 1 in which case Eq. 14 reduces to Eq. 13.
b. Compute generalized dimensions of the chaotic saddle: The generalized dimensions are an important property of the chaotic saddle and its calculation is often performed by box counting [5,11,44]. Essentially, the phase-space is divided in equal, non-overlapping, and space-filling boxes i = 1, ..., B(ε) of linear size ε (intervals in 1 dimension, squares in 2, etc.) and the generalized dimension D q of exponent q is proportional to log B i (ε)µ q i , where µ i is the fraction of points of the saddle that belong to the box i [44]. Numerically, µ i is estimated by first obtaining a set of points x j in the saddle and then counting how many are in the particular box i. Such an estimate can be written as the expectation of an indicator function that tells whether a state in the saddle, F te/2 (x) for t e (x) 0, is inside the box i, W (x) = δ F te /2 (x)∈i . This expectation can be written as a conditional expectation of W (x) over states that leave at time t e , Computing this essentially requires computing integrals of the form of W (t e ) in Eq. 14. c. Compute the distribution of FTLE on the chaotic saddle: The distribution of FTLE P (λ) is another important property of the chaotic saddle [11,44]. The FTLE λ = λ t for a fixed t of an open system is computed for trajectories on the chaotic saddle. Like in the previous problem, each of these trajectories can be obtained by generating a state x according to U (x) (e.g. uniformly distributed in Γ) that has a large escape time, t e (x) 1, and compute λ t (F te/2 (x)) using Eq. 6 for a fixed t. The distribution of FTLE is then computed from an ensemble of these high-escape-time states by constructing different bins of the histogram I λ = [λ, λ+ ∆λ], and numerically compute the relative number of states in each bin. Formally, this equates to compute the expected number of states with a given escape time t e whose FTLE is in a bin, and thus corresponds to computing a conditional expectation of the form of Eq. 15 with W (x) = W λ (x) = δ λ(F te/2 (x))∈I λ t .
C. Summary: numerical problems in the study of rare events in chaotic systems To provide an unified treatment of the numerical problems in transient chaos and in computing FTLE of closed systems we use a common notation whenever possible. It indicates also how the methods can be generalized to different problems. Firstly, there is a quantity that we call "observable" and denote by E(x): in strongly chaotic open systems Secondly, we use δ(E, E ) to denote both the Kronecker delta and the Dirac delta (δ(E − E )), for discrete and continuous case respectively. This is needed because, even though in practice we will always consider E to be a discrete function due to binning of the histograms, formally the observable E is either discrete (t e for maps) or continuous (λ t or t e for flows). Thirdly, the observable E is a function of the phase-space of the system but typically depends also on an external quantity N that parameterizes how high E can be: This parameter is important to us because, the higher it is, the rarer a state with the maximum or minimum possible E is. In this notation, the two general problems presented in the previous sections can be summarized as follows: • the analysis is conditioned to a projection of x into a one-dimensional observable E(x).
• the probability distribution of E, P (E), decays exponentially with increasing E.
• the focus of the analysis is on rare states in respect to P (E): states with E in one of the tails of P (E).
• the rareness increases with N .
In transient chaos problems we are interested in states with increasing N = t max that correspond to the tail of P (E) = P (t e ), P (t max ). In computations of the FTLE in closed systems we are interested in states with increasing N = t o and on the tails of P (E). These problems share two distinct computational problems: find rare states [11][12][13]16] and sample rare states [17,18,45,46], which can be formalized as follows: • Find rare states: minimize or maximize E(x) over a constraining region Γ, x ∈ Γ ⊂ Ω, for increasing N .
• Sample rare states: compute the integral of a function W (x) conditioned to a particular value of E and for increasing N , over an ensemble of states x distributed according to U (x) in a constraining region of the phase-space x ∈ Γ ⊂ Ω: The numerical challenges of these two numerical problems are common: the states x are exponentially difficult to find with increasing N (or E), the function E(x) contains multiple local and global minima embedded in fractal-like structures, and a potential high dimensionality D of the phase-space (see Figs. 1 and 2). The goal of this paper is to develop a systematic approach to tackle the two numerical problems and these three challenges.

A. Importance Sampling
As described in the previous section, the traditional methodology to compute an integral of the form of Eq. 16 is to draw m samples {x i } distributed according to U (x) from the relevant region Γ and approximate the integral W (E) by the estimator where δ(E, E(x)) = 1 when E(x i ) ∈ [E, E + ∆E[ and zero otherwise (when E is discrete, ∆E = 1). The relative distance of the estimator For the estimator in Eq. 17, this is given by where G(E) is the density of states: the number of states per bin with an observable E. [47] Therefore, the number of samples m * required to achieve a given precision * for a given E * is m * (E * ) ∝ 1/G(E * ). The critical problem in sampling rare states is that, because G(E) decays exponentially with E, m * (E) increases exponentially with E. Importance sampling techniques aim to improve this scaling by drawing samples from a distribution π(x) = U (x) on the phase-space (e.g. non-uniformly in the phase-space) [8]. Specifically, consider m independent samples {x i } drawn from π(x), and the function π(x) to depend only on E, π(x) = π(E(x)) = π(E). Because the samples are not drawn from U (x), the estimator in Eq. 17 would be biased. Importance sampling uses an unbiased estimator for W (E) given by [8] W which reduces to Eq. 17 when π(x) = U (x). The advantage of importance sampling is that the relative error of the estimator in Eq. 19 is given by This is because, when sampling from π(x) = π(E(x)), the expected number of samples with a given E, m(E), is equal to Equation 20 implies that the function π(E) can be chosen to favour states x with observable E on the tails of G(E) and therefore improve the precision of the estimator on these tails. The standard deviations in Eqs. 18,20 were obtained assuming that the m samples were independent. In traditional methodologies such as uniform sampling, this is the case. However, in the algorithms discussed below, it is not. Therefore, it is necessary to modify Eq. 20 for the case where the samples {x i } are drawn from π(x) and are also correlated. This modification is given by where T (E) is the autocorrelation time [8], which increases with the correlation of the samples. Efficient importance-sampling [7][8][9] techniques have to address the following three steps: 1. choose a suitable π(x) 2. have a method to generate samples from π(x)

minimize the autocorrelation time T
A defining point in our method is our choice for the Metropolis-Hasting algorithm to address point 2, differently from Refs. [16,20] which address similar problems through a different choice for point 2 (cloning trajectories). Below we first discuss point 2, then point 1, and finally point 3.

B. Metropolis-Hastings algorithm
The Metropolis-Hastings (MH) algorithm asymptotically generates states x according to π(x) using a a Markovian, ergodic and detailed balance random walk in the sampling region Γ [8]. This random walk is initialized from a random state x ∈ Γ and evolves to a new state x ∈ Γ with a transition probability P (x |x) chosen such that asymptotically the states x are drawn according to π(x). In Metropolis-Hastings, P (x |x) is written as where g(x |x) is the (proposal) distribution used to generate new states and a(x |x) is the (acceptance) distribution used to select them. The random walk fulfills detailed balance because the acceptance probability is chosen as [8] a(x |x) = min 1, Metropolis-Hastings algorithm is not the only way to achieve this. Another popular and alternative method to sample from π(x) is population Monte Carlo, which instead of a random walk, uses multiple stochastic trajectories that are cloned and destroyed. See e.g. ref. [18] for an application to the problem of the FTLE described above.
Algorithmically, the MH algorithm is implemented as follows: choose a region Γ and a random initial condition x ∈ Γ. Evolve the random walk in time according to: 1. Propose a state x drawn from g(x |x); 2. Compute a(x |x) replacing x and x in Eq. 23; 3. Generate a random number r in [0, 1]. If r < a(x |x), make x to be the new x; 4. Store x and go to 1.
The set of sub-steps 1-4 brings the random walk from its current state x to the next state and it is called a Markov step. After a transient number of steps where the algorithm converges to the asymptotic distribution, the stored states x are (correlated) samples drawn from π(x), and can be directly used in Eq. 19.
In the context of rare states, the canonical ensemble is useful because the number of sampled states m(E) in Eq. 21 becomes In particular, the maximum of m is at E * solution of β = dS/dE(E * ). Therefore, β tunes which value of the observable E is sampled the most. For example, the Lyapunov weighted dynamics in Ref. [18] uses this distribution (Eq. 15 of the ref. with α replaced by β). e. Flat-histogram ensemble Another distribution often used in the literature of Metropolis-Hastings [48][49][50][51][52][53][54][55] is the flat-histogram (or multicanonical), given by [53,54] for a given choice of E min , E max that defines the region of interest on the observable E. This is known as flathistogram because, replacing Eq. 26 in Eq. 21 leads to a constant average number of samples on each E, Consequently, the dependence of the variance in Eq. 22 is only due to the autocorrelation T (E), which implies that the computational cost to draw a state on the tail of G(E) is no longer driven by the exponential decrease of G(E), but by the computational cost to draw uncorrelated samples from π(x).
The main limitation of the flat-histogram is that it requires knowing G(E) in advance, which is very often unknown.
The most well known, that we use here, is the Wang-Landau algorithm [54], that modifies the Metropolis-Hastings algorithm to a nonmarkovian chain that asymptotically converges to a flathistogram Metropolis-Hastings. The Wang-Landau algorithm starts with an approximation of G(E), G W L (E) = 1/|E max − E min |, and, on step 4 of the MH (see algorithm in Sec. III B), it multiplies G W L (E(x)) (x is the current state of the random walk) by a constant f > 1. After a given number of steps, f is reduced by a factor 2 and this procedure is repeated until a final f min 1 is reached [54]. The value f min and how f is reduced dictates how close G W L (E) will be from G(E) [54,56,57].

D. Characterisation of the efficiency
The relative error (E) depends on the value of N and on the particular value of E/N . To avoid discussing the dependency on E/N , the efficiency of the flat-histogram is often quantified in terms of the average round-trip [58][59][60], which is an upper bound for the number of Markov steps m (samples) required to obtain an uncorrelated sample from π(x) [60,61]. The round-trip, τ , is the average number of steps required for the algorithm to go from a state x with E(x) = E min to a state x with E(x) = E max and return back. [62] Ref. [45] shows how the autocorrelation time of a canonical ensemble is related to the autocorrelation time of a flat-histogram, and therefore we will use here the round-trip time of a flathistogram simulation to quantify the efficiency for both distributions. The uniform sampling has a round-trip that increases exponentially with N : on average it takes 1/G(E max ) samples to get one sample with E max , and 1/G(E max ) increases exponentially with N .
Importance sampling Monte Carlo is widely used in statistical physics because the computational cost often scales polynomially with N [8,60], which is a dramatic improvement over uniform sampling. Under the hypoth- N and (b) the correlation between the different E(x) of the random walk decay fast enough, it can be shown that the roundtrip τ scales as [46,60] For example, consider the problem of sampling states of an open chaotic system with different escape times t e from t e = 1 up to t e = t max e . In this case, E min = 1 and Thus, under the hypothesis (a) and (b) above, the round-trip is expected to scale as There are known deviations of the scaling in Eq. 28 leading to a higher exponent, N 2+z with z > 0 [46,[58][59][60] and there are two common situations where this happens: (1) the acceptance rate decreases drastically with N (hypothesis (a) above is violated); (2) autocorrelation drastically increases with N (hypothesis (b) above is violated) [46,60]. Nevertheless, these do not qualitatively change the argument: Monte Carlo with flat-histogram has the potential of generating rare states polynomially with increasing N , while uniform sampling generates rare states exponentially with increasing N . To achieve a MH algorithm that scales polynomially with N, the autocorrelation time needs to be low. A key ingredient for an efficient algorithm is a good proposal distribution g(x |x) [63]. The ideal proposal distribution of a Metropolis-Hastings algorithm draws x independently of x according to π(x ), g(x |x) = π(x ). This is because i) the acceptance in Eq. 23 is always one and ii) each step of the random walk generates an independent sample x, which implies that the error in Eq. 22 is minimal. The difficulty of sampling rare events in chaotic systems is that a useful π(x) to sample them is a difficult function to sample from. For concreteness, consider the problem of sampling high escape times in the open tent map defined in Appendix VII A 3 -Eq. 66, whose t e (x) is represented in Fig. 12 -and consider the canonical sampling distribution, π(x) ∝ exp(−βt e (x)). In this example, π(x) ∝ exp(−β1) between [1/a, 1 − 1/a] and so forth (exp(−βt e )) in subsequent intervals. The number of intervals increases as 2 te . Therefore, sampling from π(x) would require enumerating every interval, sample one at random according to a correct distribution that depends on β, and then sample a uniform point within that interval. While this can be done in simple maps such as the open tent map, this is unfeasible in a general system where the t e (x) dependency is unknown.
One way to approach the problem could be to consider g(x |x) to be the uniform distribution over Γ. One could imagine that changing the sampling distribution π(x) could decrease the variance of the estimator, since this gives more preference to rarer states. However, this is not the case: changing the sampling distribution alone does not decrease the scaling of the variance of the estimator. This is because changing the sampling distribution (e.g. using a canonical ensemble) leads to an exponential increase of the autocorrelation time T (E) with E, see Appendix VII B, making it as efficient as traditional uniform sampling.
In summary, Metropolis-Hastings is an excellent candidate to approach the numerical challenges found in the study of rare events in chaotic systems. Firstly, because it is grounded in strong mathematical results such as importance sampling theorem and asymptotic con-vergence of Markov processes. Secondly, because it is formulated with very little assumptions about the system, the observable of interest or the dynamics of the system, which gives enough freedom to adapt it to the specific aim (sampling or finding), observable, and system. Thirdly, because there seems to be no theoretical reason for the sampling to be exponential; Metropolis-Hastings is used to sample rare states in polynomial time in other problems of statistical physics. Finally, because the numerical problems found in chaotic systems can be re-written as problems where Metropolis-Hastings is suitable for. On the other hand, the optimal proposal of Metropolis-Hastings is unfeasible in chaotic systems, and without any extra information about the system, Metropolis-Hastings is as efficient as the traditional uniform sampling.

IV. PROPOSAL DISTRIBUTION
The problem we address in this section is: how to incorporate general properties of chaotic systems into the proposal distribution in such a way that the Metropolis-Hastings algorithm becomes efficient? We first set an aim for the proposal distribution and we then show how this aim can be achieved in the different problems involving deterministic chaotic system.

A. Aim of the proposal distribution
The goal of the proposal distribution g(x |x) we construct here will be to bound the acceptance rate in Eq. 23 away from 0 and 1. Since a(x |x) in Eq. 23 depends on x , it is essential to look at its expectation over the proposal, Our goal is to construct a proposal distribution such that [64] The motivation to set this as the starting point (and cornerstone) of our method is that it avoids the two typical origins of high correlation times T . When the acceptance is low, T increases because the method remains stuck in the same state x for long times. High acceptance typically indicates that x is too close to x (often E(x ) = E(x)), which implies that the simulation moves too slowly (in E and in Ω). The goal here is not that the acceptance is exactly a , but rather that it remains bounded from the extremes and that it does not strongly depend on N .
In general it is non-trivial to construct a proposal distribution that guarantees a constant acceptance. Thus, the next step is to approximate Eq. 31 by a simpler condition. Let us first notice that π(x) only depends on x through E x ≡ E(x), π(x) = π(E x ). Therefore, at the very least, the proposal g(x |x) should guarantee that x is generated from x in such a way that π(E x ) is neither too close (high acceptance) nor too far (low acceptance) from π(E x ). Quantitatively, this can be written as where 0 < a < 1 is a constant. When the proposal is able to achieve a small variation of E, π(E x ) can be expanded in Taylor series around E x = E x , which allows to write [65] Inserting Eq. 33 in Eq. 32, an average constant acceptance is thus achieved when This equation, the main result of this section, is a condition that an efficient Metropolis-Hastings imposes to the proposal distribution in terms of the average difference in the observable E. This condition is non-trivial because it depends on the particular π, E, F , and x. For the sampling distributions discussed in the previous section, we obtain: f. Canonical ensemble: when π(x) = e −βEx , the condition in Eq. 34 is given by That is, the higher the β, the closer the proposed E x has to be from E x . g. Flat-histogram When π(x) ∝ 1/G(E x ), the condition in Eq. 34 is given by When E x is close to the maximum of G(E), the derivative of log G approaches 0 and E x can be arbitrary distant from E x . As E x deviates from the maximum of G(E), smaller changes are necessary to achieve a constant acceptance. Note that equation 34 is valid for the Metropolis-Hastings algorithm in general and should be of interest also in other contexts (e.g., arbitrarily large number of spins can be flipped close to the maximum of the density of states).

B. Propose correlated trajectories
The proposal distribution requires correlating the trajectory starting at x with a trajectory starting at x such that Eq. 34 holds. Fulfilling this requirement requires the ability to control E [E x − E x |x], which requires a procedure to correlate the state x with x. The aim of this section is to introduce a quantification of the correlation of two trajectories of finite-time t o that can be related with E [E x − E x |x], and present two different proposal distributions that propose a state x on which this correlation is controlled.
The observables E introduced in section II C are all dependent not only of x, but also of the full trajectory of length t o starting x. [66] One natural way to quantify the similarity of two trajectories is the length the two trajectories remain within a distance ∆ much smaller than a characteristic length of Γ. Formally, this can be quanti- because the trajectories are the same; when x is far from x, t (x, x ) = 0. However, there is another possibility for two trajectories starting at x and x to be similar: when the two trajectories are similar apart from a shift in time, see right panel of figure 3. One way to include both cases in our measure of similarity is to define (37) For t shift = 0, this recovers the case of two trajectories starting close to each other; t shift = 0, it includes situations where a trajectory starts at x close to F t shift (x). This definition is motivated by the concept of symbolic sequences. [5] The similarity of the trajectory starting from x with the one starting from x can be quantified by the number of symbols that both trajectories share, which corresponds to the t (x, x ) in Eq. 37. The definition in Eq. 37 avoids the necessity of the existence of a phase-space partition, but, for the purposes of the argument below, the two trajectories share a sequence of t states that are close within ∆.
The average correlation between two states whose one is drawn according to a proposal distribution is here defined by (38) Notice that this quantity does not depend on the particular sampling distribution or algorithm; it is a function of the proposal distribution and the state x. The goal of the next sub-sections is to construct proposal distributions that guarantee a given average correlation t (x). We will show, for example, that we can enforce an average correlation t (x) if we use a normal distribution centered around x with a specific standard deviation as our proposal distribution.

Shift proposal
One proposal that guarantees that trajectories are correlated by t is the shift proposal, originally introduced in Ref. [14] in the context of sampling paths of chemical reactions, and has also been used in Ref [22]. It consists in proposing a state x that is a forward or backward iteration of x, x = F t shift (x), where t shift is a free parameter. The relation between t shift = t shift (x) and t = t (x) is that a shift of ±t shift guarantees that t o − t shift elements of the original trajectory are preserved. Therefore, this proposal guarantees that t elements are preserved when |t shift | = t o − t , see right panel of Fig. 3. Because detailed balance has to be enforceable, the proposal must contain backward and forward shifts. A proposal that automatically fulfils detailed balance is one on which the backward and forward shifts are equally likely: . Given the target average correlation t (x), this proposal can be implemented as follows: generate a random number r ∈ [0, 1]; if r < 0.5, make This proposal unfortunately has some disadvantages: i) a priori there is no guarantee that F t shift (x) ∈ Γ. It is applicable when Γ = Ω, which e.g. is not the case in open systems; ii) it requires the map to be invertible; iii) the proposal can only propose states that are forward or backward iterations of x. Consequently, for the random walk to be ergodic in the phase-space, the map itself must be ergodic. iv) this proposal diffuses without drift on a trajectory passing through x, by shifting the starting point forward or backward. Thus, it will always sample fewer states than a time average of a trajectory starting at x. On the other hand, the main advantage of this proposal is that it performs non-local jumps in the phase-space. That is, it allows to jump from a region of the phase-space to another region while still maintaining x correlated with x. As shown below, in combination with other proposal, this proposal is useful to reduce correlations stemmed from local jumps.

Neighbourhood proposal
Another strategy to construct a proposal on which on average the states are correlated by t (x) is to perturb x by a finite amount δ, x = x + δ, characterised by a directionδ and a norm δ, δ ≡δδ. A common case is when the probability distribution is separated in two independent terms [8]: and that P (δ|x) is uniformly distributed in the D directions and P (δ|x) has zero mean (i.e. an isotropic proposal). Here we restrict the analysis to this situation, and we also assume that P (δ|x) is characterised by a well defined scale, e.g. it is an half-normal distribution [67] with mean δ x (x): This choice makes the ratio g(x|x )/g(x |x) in Eq. 23 to be The main motivation for this choice is that the proposal distribution is described by a single function, δ x (x), that quantifies the distance The goal is now to relate δ x (x) with t (x). Let us start to describe two important limits: in the limit δ x (x) → 0, the states are the same and therefore lim δx(x)→0 t (x) = t o . In the limit δ x (x) → |Γ|, the proposal is approximately equal to draw x uniformly from Γ, and x is independent of x and t (x) = 0. To preserve a correlation of t (x), it is necessary that δ x (x) is such that the two trajectories starting at x and x are close together up a time t (x), see left panel of Fig. 3. Because the system is chaotic, for small δ x (x), two trajectories diverge exponentially in time according to Eq. 4, and, in particular, their maximal distance is given by Eq. 7. Therefore, to guarantee that two trajectories are distanced at most by ∆ after a time t (x), δ x (x) must be given by This equation relates the parameter of the proposal distribution, δ x (x), with the average correlation t (x) of the two states x and x . The neighbourhood proposal derived above is closely related to a proposal described in Ref. [22] as "precision shooting". Precision shooting constructs a trajectory {x i } with x 0 ≡ x = x+δδ (where δ is a free parameter) that, within the numerical precision of a computer, is indistinguishable (in the system considered) from a trajectory starting at x with δ small. The trajectory {x i } shadows a true trajectory starting at x , in the same spirit as the algorithm used in Ref. [68] to construct a pseudo-trajectory. Thus, precision shooting can be interpreted as the neighbourhood proposal, Eq. 43, with t a free parameter (related to δ via Eq. 43), that assumes shadowing theorem to simplify the construction x . Ref. [22] discusses how the acceptance rate depends on δ, suggesting that the acceptance rate increases with decreasing δ (Fig. 9 of the ref.). In light of the discussion in section IV B, this result is interpreted as follows: as δ decreases, x becomes more correlated with x (since t is related with δ by Eq. 43), and therefore the acceptance is expected to increase, as indicated in Fig. 9 of Ref. [22]. This discussion is unfortunately insufficient to us because it does not allow to derive δ (or t ) that fulfils the condition in Eq. 34. The crucial advantage of Eq. 43 is that it allows to relate E [E(x ) − E(x)|x] (in Eq. 34) with x − x (in Eq. 43) through t (in Eq. 37). This is the goal of the next section.

C. Guarantee local proposals
Now that we derived proposal distributions that enforce an average correlation t (x), the next (and final) step is to obtain a relationship between t (x) and Because the computation of t (x) depends on the particular observable E, a different derivation is presented for two observables, t e and λ to . Given the limitations of the shift proposal described above, the argumentation below is limited to neighbourhood proposals (an equivalent argumentation can be made to the shift proposal).

FTLE in closed systems
As introduced in section II A, the observable in this case is given by E(x) = t o λ to (x), where t o is the finite time. The aim in this case is thus to write where x t ≡ F t (x). Likewise for the the trajectory starting from x , Because x is proposed according to Eq. 43, by construction, the first t states of the trajectory starting at x are close (within ∆) to the states of the trajectory starting at x up to t (x). Therefore, we can approximate that the respective Lyapunovs up to time t are equal, Subtracting Eq. 44 from Eq. 45 and using Eq. 46 gives The left side of this equation is the same as in Eq. 34 and thus the aim now is to write the right side as a function of properties of the system. Let us focus on the calculation of E λ to−t (x t )|x first. By construction, x is generated such that |x t − x t | ≈ ∆. Because the system is chaotic, one can approximate that x t is sufficiently separated from x t such that λ to−t (x t ) is independent of x. Under this approximation, x t is essentially a random state from the phase-space, and thus λ to−t (x t ) will be a drawn from the distribution of FTLE at time t o − t .
Denoting the mean of this distribution by λ L,to−t , we get Replacing Eq. 48 in Eq. 47 gives This equation relates the expected change in the observable with properties of the system (λ L,to−t ), of the trajectory x, λ to−t (x t ), and t (x) and it can thus be used in the energy condition we obtained earlier, Eq. 34. Replacing the left side of Eq. 34 by the expectation in Eq. 49 and solving to t (x) gives .
(50) We can further simplify this relation with two approximations: a) for large t o − t , the mean FTLE at time t o −t , λ L,to−t , is approximately the Lyapunov exponent of the system, λ L , b) because trajectories of chaotic systems are shortcorrelated in time, the FTLE of the trajectory x up to t will be approximately equal to the FTLE of the trajectory starting at x t . Thus, Using Eq. 51 and Eq. 52, t (x) in Eq. 50 can be written as .
To enforce that t ∈ [0, t o ], we use which is the main result of this section. This equation provides an expression to t (x) that can be inserted in the parameter of the proposal distribution, Eq. 43, that under the approximations used above achieves a constant acceptance rate [70].
The derivation of Eq. 54 can be generalised to other observables which, as λ to , can be written as an average over the trajectory: consider where f (x i ) is an arbitrary function of the phase-space (the logarithm of the derivative of the map corresponds to e to (x) = λ to (x), E to (x) = λ to (x)t o ). Replacing this quantity in the derivation of Eq. 54 mutatis mutandis and without using the approximation in Eq. 52, one obtains where E L is approximately the mean of the distribution P (E to ) (using the approximation in Eq. 48). This generalizes Eq. 54 for an arbitrary average over trajectories of size t o , e to (x), and it should be useful to sample rare states in respect to observables correspondent to expected values over trajectories.

Escape time in strongly chaotic open systems
As introduced before, in strongly chaotic open systems E(x) = t e (x) and P (t e ) ∼ exp(−κt e ). The aim here is to compute t (x) that fulfils Eq. 34 by taking into account that x is given and x = x +δδ x (x) with δ x (x) given by Eq. 43. In this case the trajectory's length is not a constant but instead it is given by the escape time t e (x).
A trajectory starting at x proposed according to Eq. 43 fulfils |x t −x t | ≤ ∆. Therefore, up to t , the two trajectories are indistinguishable. Under this assumption, and because from the definition of t in Eq. 37, t e (x ) = t (x) + t e (x t ) and therefore where t e (x t ) is the escape time of the state x iterated t times and E t e (x t )|x = Ω g(x |x)t e (x t )dx is the expected t e (x t ) over a neighbourhood of x of size δ x (x) (given by Eq. 43)). The idea behind this equation is represented in Fig. 4. The proposal density g(x |x) at x is such that, at t , the two trajectories distance themselves on average by ∆. After t , the trajectory starting at x t will approximately be independent of x and thus t e (x ) = t + t e (x t ). Moreover, proposing according to Eq. 43 is equivalent to "zoom" the landscape of t e (x) around x with a scale correspondent to t 'th iteration of the construction of the landscape. Given the self-similarity of the landscape (see e.g. Fig. 13), under this zoom, the landscape of t e (x t ) is equal to the landscape of t e (x) and therefore E t e (x t )|x should be a constant independent of t e . Moreover, because x t is approximately independent of x, E t e (x t )|x is the just average escape time of an independent state x t , which is the average of P (t e ) and is given approximately by 1/κ. Thus, E t e (x t )|x is the average escape time of an independent state x t , which is the average of P (t e ) and is given by 1/κ: It is this result that incorporates the self-similarity of the escape time function: the value of t chooses the particular zoom of the landscape (Fig. 4), and this equation assumes that, as long as the zoom is proportional to exp(λt ), the average escape time of the phase-space of the zoomed region is still 1/κ. Replacing Eq. 58 in Eq. 57 gives and subtracting t e (x) on both sides of Eq. 59 gives The left side of this equation is the left side of the condition of constant acceptance rate, Eq. 34. Equating both left sides and solving for t (x) gives which is analogous to Eq. 54 and is the central result of this section. Together with Eq. 43, it is the proposal distribution we had the goal of constructing.

D. Summary: how to propose
The argument in this section can be summarised as follows: • Metropolis-Hastings requires a proposal that guarantees a specific variation of E, which we approximate by Eq. 34.
• the proposal with a scale given by Eq. 43 guarantees that on average the trajectory starting at x stays close to the trajectory starting at x up to t (x).
• the variation of E is related to t (x) via Eqs. 49 and 60.
These three steps led to analytical formulas for t (x), Eqs. 61 and 54, that make the proposal satisfy Eq. 34. Together with Eq. 43, they define proposal distributions required for an average constant acceptance. There are 3 points that deserve to be noted: the first point is that the formulas we obtained for t (x) are also valid for the problem of finding rare states discussed e.g. in Refs. [12,13,16,17,71]. Specifically, these formulas were derived to guarantee Eq. 34, which dictates how different E [E(x )|x] has to be from E(x) to guarantee a constant acceptance. This condition is stronger than the condition required for an algorithm to find minima or maxima of E, which requires only proposing states x such that E [E(x )|x] > E(x) (or vice-versa for minimising E). This is independent of the particular minimisation algorithm (e.g. stimulated annealing, step descent, stagger and dagger in open systems) because it only discusses which new state x should be tried, given the current state x. The second point is that Eqs. 61 and 54 reduce the proposal to a uniform distribution when the sampling distribution is the uniform distribution: d log π(E)/dE(E) → 0 implies δ x (x) → ∞. The third point is that different proposals in the literature, precision shooting [22,46], exponential proposal distribution [12,17], and the one in Ref. [45], can be obtained from the proposals derived in this section through different approximations, as shown in Appendix VII C. Overall, this section described a framework to add information of chaotic systems, in this case the self-similar properties of the landscape, the exponential divergence of trajectories, and the exponential decay of correlations, to construct an efficient Metropolis-Hastings algorithm to sample them. A simplified description of the algoirthm is given in in Appendix VII D and an open-source implementation in Ref. [1]. The next section is devoted to test the assumptions used here on each of the problems, escape time and FTLE, and confirm the practical usefulness of the framework.

V. NUMERICAL TESTS
The previous section concluded with a set of formulas -the proposal in Eq. 43 combined with the formulas for t (x) Eq. 54 or Eq. 61 -for proposing states x that are expected to guarantee the desired acceptance rate (bounded from 0 and 1). In this section we test some of the approximations made in the derivation of t (x) (in a simple system) and we analyse the efficiency of the algorithm (confirming the polynomial scaling) in the computation of the FTLE in closed system -introduced in Sec. II A -and of the escape time in open systems -introduced in Sec. II B. The tests are performed in the skewed tent map and open skewed tent map, and the efficiency is tested in numerous maps (chain of couple Hénon maps, Standard map, Logistic map). All these systems are introduced in detail in Appendix VII A.

A. Finite-time Lyapunov exponent
The first approximation made in the derivation of Eq. 54 is that when δ x ≡ |x − x| is drawn from a halfnormal distribution with scale parameter δ x (x) given by Eq. 43, F t (x ) is sufficiently close from F t (x) such that E [λ t (x )|x] = λ t (x), Eq. 46, holds. We test this numerically by randomly drawing 10 5 states x i in the tent map with a = 3 (see Appendix VII A), and, for each, propose a state x i = x i + δ(x) according to Eq. 41 with δ x (x) given by our Eq. 43. From the 10 5 pairs of states (x i , x i ), we estimate the difference of the observables: Our expectation is that for a fixed t = t, E − E should be much smaller than E (the relevant scale in Eq. 47). We numerically obtain that within the 99% quantile, E − E ≈ 1 independently of E and t. This value is much smaller than E ∈ [0.6t, 1.1t], specially since we are interested in large t. This strongly supports the approximation we make in Eq. 46.
The second approximation made in the derivation of Eq. 54 is that there is no dependence between λ t (x) and λ to−t (F t (x)), Eq. 48. In other words, that the FTLE of the trajectory starting at F t (x) and ending at t o is indistinguishable from the one drawn from the distribution of FTLE with finite-time t o − t , P (λ to−t ).

FIG. 5.
The finite-time Lyapunov exponent of the first half of the trajectory is independent from the one of the second half of the trajectory. The graph was obtained by sampling 10 5 random initial conditions xi and compute λt(xi), λt(F t (xi)) = (first half, second half) of a 2t = 16 steps trajectory. The y axis represents the mean (full black) ± 2 standard deviations (full blue) of λt(F t (xi)) conditioned to a given λt. The dashed lines represent the same mean and standard deviation, but over all points (without conditioning). When λt(F t (xi)) is independent of λt(xi), the dashed and full lines are the same within fluctuations, as observed. A Kolmogorov-Smirnov test comparing the un-conditioned and conditioned distributions gives a p-value higher than 0.001 (hypothesis that they are independent is not rejected).
This approximation was numerically tested by drawing points x i , computing the pairs λ t (x i ), λ t (F t (x i )) (i.e. 2t = t o = 2t), and testing whether the conditional probability of λ t (F t (x i )) equals the (unconditional) probability of λ t (x i ). The results in Fig. 5 confirm the equality of these probabilities.
We finally test whether a proposal using t (x) given by Eq. 54 guarantees a constant acceptance rate, the original motivation for our calculation. The test consisted in sampling 10 6 states according to the following procedure: 1) uniformly draw a state x i and compute λ i ≡ λ to (x i ); 2) generate a state x i according to the proposal distribution Eq. 41 with δ x (x) given by Eq. 43 and t given by Eq. 54 (δ 0 = 1), and compute λ i ≡ λ to (x i ); 3) store G i ≡ g(x|x )/g(x |x) computed from Eq. 42: δ x is given by Eq. 43, and |x − x| is given by storing δ i = |x i − x i |. The ratio of the target distribution is given by is given by Eq. 64 for the flat-histogram. The numerically estimated acceptance ratio, A(λ t ) ≡ min(1, r i × G i ) is shown in Fig. 6. It is not independent of λ to -the expected outcome based on the assumption of constant acceptance used in our derivations -but it is bounded from 0, 1 for increasing N = t o -the original requirement for an efficient simulation set in Sec. IV A. In the canonical ensemble, there is linear dependency of Π with λ to , and in the flat-histogram ensemble, the ratio is 0.8 in the maximum of P (λ to ), and decays to about 0.1 on the tails.

Efficiency of the flat-histogram ensemble
The success in achieving a bounded acceptance independent of N , as the one confirmed in the previous section for the tent map, does not guarantee the efficiency of the method (see Sec. III D). Here we test the efficiency of flat-histogram computations of P (λ t ) in the tent map that use the neighbourhood proposal, the shift proposal, and both (mixed proposal). The results shown in Fig. 7 suggest that only when both proposals are used the efficiency scales polynomially with This result can be understood looking at the landscape of λ t , as illustrated in Figure 8. Imagine a flat-histogram simulation on this system, for t o = 4 (black curve in the figure), and analyse what happens to it in terms of a round-trip. Lets suppose that the simulation was recently at the minimum λ t (0.41) and that the next round-trip is made by going to the maximum λ t (1.09) and return back. Lets further suppose that the simulation eventually got to a state with λ t ≈ 0.92. Because π(x) = π(λ t (x)), every state at that λ t is equiprobable. Therefore, the state can be at any plateau (of the 4, see fig.), proportionally to their plateau-size. However, not every plateau contains, on its neighbourhood, a neighbour plateau with higher λ t , for example, the plateau around 0.3. Therefore, a local proposal would never be able to reach a higher plateau from a state on such a plateau. First, it would need to go backward, reach the maximum of P (λ t ) (around 0.69), where the proposal proposes any other state, and then try to find another path towards a higher λ t . This would already be the problem if the simulation would be a canonical ensemble with a β favouring higher λ t 's, since it would require decreasing λ t by an amount ∆λ, and this happens with a probability that decays exponentially with ∆λ. This is solved by using a flat-histogram ensemble. However, the crucial challenge here is that as t increases, the number of local maxima also increases, but the number of maxima connected with the global maximum is constant: the red curve, with t = 6, contains now 11 plateaus for λ t ≈ 0.87, but only 1 is locally connected to the maximum λ t , the one around 0. This means that, as t increases, it becomes more difficult to perform a round-trip: not only because the expected time to diffuse increases (see Eq. 28), but also because there are more times where the simulation diffuses forward and backward until it reaches the global maxima. The shift proposal alleviates this problem by allowing non-local proposals in the phase-space. A shift proposes a state x on a non-neighbourhood of x, which improves the probability of reaching higher or lower λ's, which explains why a combined proposal has such a low round-trip time in this system.
Finally, we confirm that a polynomial efficiency is obtained in the computation of P (λ to ) more generally. We performed flat-histogram simulations, using the Wang-Landau algorithm to estimate P (λ to ) on different chaotic systems: the tent map, the logistic map, and the standard map (see appendix VII A for details). The re-   [5]: K = 8 and U (x) = const. was used in every case. The Wang-Landau algorithm was used to estimate the distribution prior to perform the flat-histogram and the distribution agrees with the analytical one when available [37,38]. The proposal distribution used was a mixed proposal composed by 50% chance of being the neighbourhood proposal with t = to −1 and ∆ = 1, and 50% chance of being the shift proposal with t shift = 1. Adapted from Ref. [46] sults are shown in Fig. 9 and confirm the dramatic improvement and generality of using Metropolis-Hastings to sample rare states. The simulations in Fig. 9 use t (x) = t o − 1 instead of the one given by Eq. 54. This is computationally always more expensive because the correlations due to the neighbourhood proposal are maximal (see the discussion after Eq. 54), but on the other hand this proposal is simpler because it does not require estimating d log P/dE and λ L .

B. Transient chaos
The derivation of Eq. 61 uses the assumption that when x is proposed with a scale given by Eq. 43 with t = t e , E [t e (x )|x] = t e (x), as per Eq. 57. We tested assumption in a similar way we tested the approximation of Eq. 46 for the FTLE, and consisted in uniformly drawing states x i , compute their escape time t e ≡ t e (x i ) and, for each, generate a state x i = x i + δ(x) with a scale δ x (x) given by Eq. 43 with t (x) = t e (x) and compute its escape time t e ≡ t e (x i ). The assumption is valid when, on average, t e − t e t e for large t e . We did this experiment in the following systems: open tent map with a = 3 and b = 5, standard map with K = 6, and coupled Hénon maps with D = 2, 4, 6, 8 (see appendix VII A) for ∆ = 1. We observed that in all 6 cases, the average t e − t e is smaller than 1 for all t e > 2/κ. These observations show that the assumption of Eq. 57 is valid for a broad class of chaotic systems.
A second assumption tested here is the self-similarity argument used in deriving Eq. 58. The self-similarity assumption we use in Eq. 58 is that the landscape is selfsimilar such that, irrespectively of the particular scale we choose (by decreasing t ), the properties of the zoomed phase-space remain the same. In particular, we are interested in checking that t e − t e does not depend on t e when we choose a larger scale, i.e. when t is decreased by a constant. We thus repeat the experiment above but we decrease t to t (x) = t e (x) − T with T = 0, 1/κ, 5/κ to check that increasing T decreases the average t e − t e without changing its independency with t e . Our numerical tests in the same systems as before show that t e − t e decreases with increasing T and it remains independent of t e , confirming our hypothesis.
The previous tests indicate that the proposal distribution should induce a constant acceptance rate when the Lyapunov exponent of the system, Eq. 74, is used. To confirm that this is the case, a flat-histogram simulation with an isotropic proposal distribution with width δ x (x) = δ x (t e (x)) given by Eq. 74 with λ L (t e ) = λ L was made. The results, Fig 10, reproduced from Ref. [45], confirm that proposing with λ L guarantees a constant acceptance, and that any other exponent in Eq. 74 fails to achieve so.
The above tests confirm that the derivation made in Sec. IV C 2 holds for a paradigmatic strongly chaotic open system. These tests also present a major advantage of using the approach in this paper: it allows to test the assumptions made on each step, something that other approaches, such as the ones in Refs. [12,17,18], do not explicitly allow.
The efficiency of the simulation is tested in Fig. 11 as a function of the maximal escape time considered, t max , τ (t max ) in the generic coupled Hénon maps defined by Eq. 69. It confirms the dramatic improvement of Metropolis-Hastings with the proposal derived in section II B over uniform sampling: the scaling is polynomial using importance sampling, and exponential in uniform The acceptance rate of a Monte Carlo flathistogram simulation is constant as a function of the escape time t when the Lyapunov exponent of the system λL is used. The simulation was made on the open tent map with a = 3 and b = 5 with the exact P (te) given by Eq. 67. Different curves represent using the proposal with δx(x) given by Eq. 77 with three different exponents. When the exponent is larger than λL, δx(x) effectively aims for a smaller t and therefore a larger distance t e (x)−te(x), consequently decreasing the acceptance. When the exponent is smaller than λL, δx(x) aims for a larger t than the one given by Eq. 61, and therefore t e (x) − te(x) decreases to 0 as te(x) → ∞, and the acceptance converges to 1. Adapted from [45].
FIG. 11. Polynomial scaling of the number of samples required to perform a round-trip (1 → tmax → 1) as a function of tmax of the Metropolis-Hastings algorithm with the proposal derived in Sec. IV C 2, as opposed to the exponential scaling in uniform sampling. This plot represents the average round-trip time of a flat-histogram simulation with the proposal given in Sec. IV D and number of samples required to sample tmax in using uniform sampling (line) in the coupled Hénon map, Eq. 69, for different dimensions. The two full lines represent t 2 max (lower) and t 3 max (upper), and the dashed line represents 1/P (te) for D = 4 (e.g. from Fig. 2; D > 4 have a even higher exponent). The flat-histogram was obtained by first running a Wang-Landau algorithm for 10 refinement steps, each with 100 round-trips. Each point represents the average round-trip time over 100 round-trips after the 10 refinement steps. The proposal distribution used was the isotropic, Eq. 41, with δx(x) given by Eq. 77 with δ0 = 10. sampling.
The derivation in Sec. IV C 2, the tests presented above, and results in Figure 11, show the why and how importance sampling Metropolis-Hastings can efficiently sample long-living trajectories in strongly chaotic open systems. The proposal distribution should be applicable to strongly chaotic open systems more generally, as the approximations made are expected to be valid in other strongly chaotic systems.

A. Summary of results
We have introduced a framework to sample rare trajectories in different classes of chaotic systems. It is based on the Metropolis-Hasting algorithm, a flexible and wellestablished Monte Carlo method suitable for the investigation of many numerical problems in chaotic dynamical systems (as shown in Sec. II). Our main contribution is a procedure (see Sec. IV) to construct the proposal step of the Metropolis-Hasting algorithm (which proposes a new state x given the current state x) that ensures the efficiency of the sampling. The main arguments in the construction of this procedure are: (i) set (in Sec. IV A) as an heuristic goal to have a bounded (or constant) acceptance rate (23). This generalizes the traditional heuristic [60] used in Metropolis-Hastings, E(x ) − E(x) ∼ 1 and can therefore be used more generally in Metropolis-Hastings simulations. [72] (ii) introduce (in Sec. IV B) an auxiliary quantity, the correlation time t , that quantifies the similarity between any two states. It is motivated by the notion that the observables considered in section II C are computed over trajectories, and t (x, x ) quantifies the time in which trajectories remain close to each other. We then showed how two proposal distributions, shift and neighborhood, can be used to control the average of t (x, x ) over x , t (x). A successful application of our framework leads to an algorithm in which the number of samples to obtain an independent rare sample scales polynomially with the difficulty of the problem, as opposed to the exponential increase observed in traditional uniform sampling (see Fig. 9 and 11).

B. Comparison to previous results
The importance of the proposal distribution has long been emphasized for Monte Carlo methods [8,63] and for sampling chaotic systems [12,[14][15][16]. The questions that remain open from these works, and that we tackle in our paper, are how the efficiency of the sampling method is related to the proposal distribution and how an efficient proposal can be constructed from assumptions about the system. The proposals derived in this paper can be mapped, under appropriate simplifications, to known results from the literature, specifically to the proposals derived Refs. [45,46] and the stagger part of the algorithm presented in Ref. [12]. We make this connection explicitly in appendix VII C. Another example of an algorithm that can be directly analyzed by the framework developed here is the precision shooting proposed in Ref. [22,73] and used in numerous applications of transition path sampling to chemical reactions [73][74][75][76]. The precision shooting method proposes a state x isotropically distanced from x by δ x (x) given by Eq. 43, with t (x) = t o where t o is the length of the trajectory. From our results we conclude that this proposal is sub-optimal because it over-correlates the proposed state. Another application of the results of section IV is in the algorithm Lyapunov Weighted Dynamics of Refs. [16,18], which uses a population Monte Carlo algorithm. As mentioned in these references (see also Ref. [20]), there is a parameter ε that controls how far new clones x should be distanced from the existing clone x, and that it should be neither too small nor too large. This plays a role similar to δ x (x) in section IV and an optimal ε should therefore be related to our results.
In comparison to previous sampling methods in dynamical systems, including those mentioned above and others (e.g., ref. [12]), the distinguishing feature of our results is that they provide an explicit connection between the proposal distribution and the acceptance rate. This connection, which is typically absent in Monte Carlo methods more generally, is extremely powerful because failures of the algorithm can be related to violations of the hypothesis (about the method and dynamical system) that we used in our derivations. Such violations should then be understood, and this understanding can then be inserted back in this methodology to generate new methods adapted for that situation. We hope this process will increase the range of applicability of our framework to other classes of dynamical systems (e.g., non-hyperbolic systems [31,42]) and observables E. Possible improvements of our results can be obtained considering anistoropic search domains, an idea that has shown to be essential in the case of finding chaotic saddles in systems with more than one positive Lyapunov exponents [43].

C. Discussion
The proposal distribution is a way of moving in the phase-space stochastically and how to select a new state x from a given state x is a general problem in different numerical techniques. Some of the most successful numerical algorithms in the literature, such as the golden section search or gradient descent, are essentially generic and efficient ways of selecting a new state. The results in Figs. 7, and 9 show that the proposal distribution strongly influences the computational cost of the different procedures, often irrespectively of the particular sampling procedure (canonical or flat-histogram) and problem (sampling or finding). This reinforces the notion that the proposal of the new tentative state from the current state is a crucial factor when developing numerical techniques for optimization and numerical integration. There we can expect that the main insights of this papers, e.g. the direct connection between fundamental properties of chaotic systems and the optimal proposal distribution, to be useful also for other problems (e.g., to the optimization problem in which one is interested in maximizing or minimizing the observable).
The development of a numerical algorithm requires compromising between how fast it solves a particular problem, and how it is able to solve different problems. One interesting aspect of the algorithms (i.e. proposal distributions) introduced in this paper is that even though they can be made very specific (e.g. propose with the FTLE of the trajectory, Eq. 43), they can also be made more general (e.g. propose using the Lyapunov exponent of the system, or the power-law proposal distribution, that does not use any specific information about the system). That is, more specificity requires more information (the FTLE of the trajectory) and makes the algorithm more efficient, and less information (only the Lyapunov of the system) makes the algorithm less specific, but also less efficient. This demonstrated capability of this methodology shows how it is not only useful to study a particular system on which some information about it is known, but also useful to situations on which less is known about the system. This does not mean that the methods apply to all problems, as there are important classes of chaotic systems on which some of the assumptions used in section IV are violated. For example, in non-hyperbolic systems [31,42] trajectories may remain correlated for a long time, which implies that one cannot assume that after t the trajectories are independent. Nevertheless, because the framework was outlined in the form of adding known information about the system, it is possible that improved insights about a class of chaotic systems can be translated to a faster algorithm.
One advantage of the methodology presented here is that it is not restricted to specific observables E(x). In principle, it can been used to construct proposal distributions to sample rare states in different observables E, E(x) = tλ t , and E(x) = t e (x). While it remains unclear what is the precise class of observables for which our methodology allows to construct an efficient proposal distribution, the different derivations of the proposal distribution do provide insights on the observables for which a proposal distribution could be constructed from properties of the system. As argued at the end of Sec IV C 1, observables computed as an average along the trajectories are similar to the FTLE and therefore the proposal distribution, derived in Eq. 56 should lead to efficient algorithms in these cases.
Altogether, our results reveal a fascinating interplay between the chaotic nature of some non-linear systems and the numerical techniques available to study rare events. It reinforces the idea that an efficient proposal requires information about the system (e.g., our derivation of the proposal distribution used the fact that "trajectories diverge exponentially" and that "the escape time function is a fractal-like function"). The analysis of this interplay allows to both better understand these systems and better understand these numerical techniques. This understanding opens perspectives to develop better techniques to numerical study rare trajectories and extreme events in non-linear systems more generally.

A. Appendix 1: Dynamical Systems
In this appendix we describe the different dynamical systems that we use throughout the paper.

Skewed Tent Map
A paradigmatic example of a strongly chaotic system is the tent map [5], defined on Ω = [0, 1] by where a > 1 is a constant and b ≡ a/(a − 1) This map contains the main features of a chaotic system: it has a positive Lyapunov exponent (λ L = a log(b)+b log(a) a+b ) and a positive measure. The finite-time Lyapunov exponent is given by where i(x) is the number of times x t ∈ [0, 1/a]. Its distribution of the finite-time Lyapunov exponent for U (x) = 1, P (E) = G(E), can be computed analytically and is a binomial,

Standard map
The standard map considered here is defined by x ≡ (p, θ) ∈ Ω = [0, 1] × [0, 1] that evolves in time according to We focus on the parameter K = 6 which leads to a phasespace with no visible KAM islands. We also consider the leaked (open) version this map by introducing a hole into the system at Λ = [0.1, 0.1].

Skewed Open Tent Map
The paradigmatic example of a strongly chaotic open system is the open tent map [11], defined on Ω = [0, 1] by The open tent map, and its corresponding surviving set, equal to the construction of the cantor set. (Left) The open tent map, where the escape correspond to states inside the interval in the middle, that maps to outside the unit interval. (Right) An iteration of the open tent map with a = b = 3 corresponds to remove the middle third of each of the plateaus of the surviving set at time t and the set that survives this removal is the surviving set at t + 1. The third middle Cantor set is the surviving set at t → ∞.
FIG. 13. The escape time function, te(x), of the tent map for a generic a and b. There are 2 t intervals, and the size of each plateau can be analytically computed from the one at a previous time, and therefore be written analytically. Specifically, a given interval has size ε(x) is the number of times 0 < F t (x) < 1/a, for t = 1, ..., te(x). The crucial observation is that λ te(x) (x) is proportional to log(ε(x))/te(x). Adapted from Ref. [45].
where a > 1 and b > a/(a−1). The state exits the system when it leaves the unit interval, i.e. Λ = R−Ω. This map contains the main features of an open chaotic system: it has a positive Lyapunov exponent (λ L = a log(b)+b log(a) a+b ), an exponential decay of the escape time distribution, with κ = − log(1/a + 1/b), and a conditionally invariant measure that is fractal with a (non-integer) fractal dimension D 0 given implicitly by

Coupled Open Hénon Map
As a generic example of a high-dimensional strongly chaotic open system, we consider a set of d coupled Hénon maps on a ring, defined by a state x = (x 1 , y 1 , ..., x d/2 , y d/2 ) ∈ Ω = R d where each individual map (x i , y i ) evolves according to Here we show that the choice of sampling distribution does not necessarily decrease the scaling of the variance of the estimator. Our goal is to compute the average acceptance rate for a given escape time t e in the canonical ensemble with a uniform proposal distribution, g(x |x) = 1/|Γ|. The acceptance is given by a(x |x) = min{1, exp(−β(t e (x ) − t e (x))} where β < 0 is used to reach higher E(x) = t e (x). The acceptance of a state x is given by Because g(x |x) = 1/|Γ| and π only depends on t e (x), a(x) does not depend on x, only on t e : a(x) = a(t e (x)). The average acceptance rate at a given t e , A(t e ) ≡ E [a(x)|t e ], is given by Because a(x) only depends on t e , it can be pulled out of the integral, and thus A(t e ) = a(t e ). Taking into account that P (t e ) = κ exp(−t e κ), the acceptance rate can be computed analytically by integrating Eq. 70 and leads to This shows that the acceptance rate decays exponentially with increasing t e (recall that β < 0). Because low acceptance implies that the random walk stays on the same state for a long time, this leads to an exponential increase of the autocorrelation time T (E) and therefore an increase of the variance in Eq. 22. This same argument applies to a flat-histogram simulation, where π(x) ∝ exp(κt e (x)).

C. Appendix 3: Simplified proposals
This section presents approximations that can be used to simplify both the implementation time and the com-putational cost of the proposals derived in Sec. IV. These approximations often reduce the general proposal derived in Sec. IV to particular proposals already found in the literature, and therefore explains such proposals in this wider context.

Propose with the Lyapunov exponent in open systems
Eq. 43 requires computing λ te(x) (x), even though the main interest is in t e (x). This calculation requires using a numerical algorithm or multiply a product of matrixes [23], both of which have an associated computational cost. A simplification to this proposal is to approximate λ te(x) (x) by the maximum of the distribution of FTLE with finite-time t e , λ L (t e ), λ te(x) (x) ≈ λ L (t e (x)) .
This approximation is valid as long as λ te (x) is not on the tails of the distribution of FTLE with finite-time t e , P (λ te ). A sampling distribution that only depends on t e , π(x) = π(t e (x)), guarantees that states with the same t e are equally sampled, P (x|t e ) = U (x), and therefore P (λ te (x)|t e ) = P (λ te ). Furthermore, the approximation of using the maximum of the distribution holds because the tails of P (λ te ) decay exponentially with increasing t e (see sec. II A). Under these approximations, Eq. 43 can be simplified to Furthermore, λ L (t e ) converges to the Lyapunov exponent of the system, λ L , with increasing t e . Therefore, a further simplification is to use the Lyapunov exponent of the system instead of λ L (t e ) in Eq. 74, Furthermore, the t (x) we derived in Eq. 61, in the flathistogram ensemble and with log G(t e ) ∝ −κt e , can written as Defining the constant δ 0 ≡ ∆e −aλ L /κ , we can write This equation is exactly the proposal derived in Ref. [45], and shows that the proposal derived in Sec. IV C 2 generalises the proposal in Ref. [45].

Adaptively estimate the Lyapunov exponent
Using the proposal distribution with λ L requires a priori knowledge of it, which typically is not available. This difficulty resembles the same problem that flat-histogram simulations have: G(E) is required, but it is typically unknown a priori. This analogy motivates a Monte Carlo procedure that on the fly computes δ x (t) that scales with λ L .
Consider an hypothetical simulation with an isotropic proposal distribution (Eq. 41) with δ x (x) = σ(t e (x)) , where σ(t) is initially set to be σ(t) = 1 for every t. Consider also that the simulation reached a state x with a high escape time (e.g. t e = t e (x) = 10/κ). A proposed state, x = x+ĥσ(t e ), will most likely have a much lower escape time (e.g. t e (x ) = 1/κ). From Eq. 43 and Eq. 61, this indicates that σ(t e ) is much higher than the "correct" proposal, δ x (x), and therefore it should be reduced in the next proposal. The opposite is also true: when σ(t e ) is much smaller than δ x (x), t e (x ) = t e (x) and it should be increased. This hypothetical simulation suggests that, in the same spirit as the Wang-Landau algorithm to approximate the density P (t e ), there is the possibility to approximate δ x (x) using an update scheme that can be inserted in the Metropolis-Hastings algorithm, and that is given by the same algorithm as the Wang-Landau (see sec. III C 0 e), but instead of updating P W L (t), it updates also σ(t) [45]: σ(t e ) = σ(t e )f for t e (x ) = t e σ(t e )/f for t e (x ) < t e .
This update scheme generalises the Wang-Landau procedure to the proposal distribution. It is expected to converge to a function σ(t) that decays exponentially with the Lyapunov exponent of the system, and a proposal distribution with a constant acceptance rate in a flat-histogram simulation. It was extensively tested in different systems (tent map, full chaotic standard map with a leak, Coupled Hénon map with different Ds), see Refs. [43,45,77].

Power-law proposal distribution
The proposal distributions derived in the previous sections requires some knowledge about the state and the system: λ to (x), t e (x) (in open systems), λ L of the system, and, in some situations, G(E(x)). Another alternative to avoid computing δ x (x) is to consider a proposal on which the time t that the two trajectories remain together is not imposed by a given t (x), but that is a uniformly random variable between [0, t o ] (FTLE) or [0, t e (x)] (open systems) that is generated on each proposal. Some values of t will be far from the optimal t (x) and the corresponding x will be rejected or it will be too close from x, but others t will still be close from the optimal t (x) and therefore useful.
Having a uniformly distributed correlation t still requires computing δ x (x) in Eq. 43, which requires λ to (x). In the case λ t (x) is unknown (e.g. in open systems one could be only interested in the escape time and therefore not compute λ t (x)), one may further approximate it by an uniform distribution between two extremes. Because the product of two uniformly distributed random variables is also uniformly distributed, this leads to a proposal distribution where δ x (x) is given by exp(−U (a, b)) where a and b are free parameters. By standard transformation of variables, this leads to a scale δ x (x) that is power-law distributed and given by where δ min and δ max ≈ |Γ| are two free parameters. The δ in x = x + hδ is an half-normal distribution with a scale δ x , but since this scale is now power-law distributed, it is no longer necessary to use the half-normal distribution altogether; instead, it is possible to just use the powerlaw proposal distribution where δ is drawn from P (δ x ) in Eq. 80, i.e. |x − x| is power-law distributed according to Eq. 80. Without the half-normal distribution the proposal distribution no longer depends on x and therefore g(x |x)/g(x|x ) = 1.
The stagger part of the algorithm of Ref. [12] proposes exactly with a scale given by Eq. 80 and the argumenta-tion above explains why the proposal distribution used in Ref. [12] to find states with high-escape time t e is reported to work well: it is a proposal distribution that proposes x correlated with x with a correlation t that is uniformly distributed, which eventually proposes x with the optimal correlation t (x). To confirm this explanation, let us consider a flat-histogram simulation with a power-law proposal distribution on the open tent map and consider the measurement of E [log δ x |A * , t e ], where A * is the condition ε < a(x |x) < 1 − ε (of bounded acceptance). Under the above argumentation, the scale log δ x that contributes to a bounded acceptance is given by −λ te t e , per Eq. 77. This is confirmed by numerical simulation, shown in figure 14, and was obtained also for the problem of finding rare states, as reported in Ref. [43]. This result, combined with the derivation of t (x), explains the success of the proposal (the stagger) used in Ref. [12] from basic notions of chaotic systems and numerical methods.