Importance sampling of rare events in chaotic systems

Finding and sampling rare trajectories in dynamical systems is a difficult computational task underlying numerous problems and applications. In this paper we show how to construct Metropolis-Hastings Monte-Carlo methods that can efficiently sample rare trajectories in the (extremely rough) phase space of chaotic systems. As examples of our general framework we compute the distribution of finite-time Lyapunov exponents (in different chaotic maps) and the distribution of escape times (in transient-chaos problems). Our methods sample exponentially rare states in polynomial number of samples (in both low- and high-dimensional systems). An open-source software that implements our algorithms and reproduces our results can be found in reference [J. Leitao, A library to sample chaotic systems, 2017, https://github.com/jorgecarleitao/chaospp].


Introduction
Extreme events play a crucial role in our society. Landslides, floods, meteorite collisions, solar flares, earthquakes are all events that are rare but often lead to catastrophic consequences to our well being [1]. Science often studies extreme events by recreating the process that generates them sufficiently many times. While in some cases the process can be reproduced experimentally, in many others (e.g., astronomical and weather events) the best we can do is to simulate physical models.
Physical models often contain non-linearities in the equations of motion that amplify fluctuations and give origin to extreme events. A famous effect of non-linearity is chaos, the extremely sensitive dependence on initial conditions. Chaos has a fascinating history dating back to the seminal work of Poincaré in 1890 [2] and is today a well established field of research with applications in Biology, Geology, Economy, Chemistry, and Physics [3,4]. Chaotic dynamics often hinders our ability to study the evolution of the system analytically, e.g. it forbids describing trajectories in a closed formula. In this situation, again, often the best we can do is to numerically simulate the model in a computer, a paradigm popularized by Lorenz since the 1960s [3,5]. Extreme events are then studied statistically, over an ensemble of initial conditions.
In this paper we introduce a framework for performing numerical simulations in chaotic systems in such a a e-mail: eduardo.altmann@sydney.edu.au way that rare trajectories are generated more likely than if they would be generated by chance. This is an importancesampling [6,7] (or rare-event-simulation [8]) strategy that builds on methods that have proven successful in the characterization of rare configurations in various problems. The distinguishing feature of our framework is that it considers general classes of observables [9] in deterministic chaotic [4,10] systems, being therefore able to find and sample initial conditions leading to extreme events in different problems. We apply our framework to two traditional problems in the numerical exploration of chaotic systems: trajectories with high or low finite-time Lyapunov exponents; long-living trajectories in open systems.
The results shown in this paper are general, but simulations are done in simple dynamical systems (time-discrete maps with up to 16 dimensions). Our goal is to illustrate the generic computational challenges of sampling rare events in chaotic systems, in line with the tradition that simple systems often possess the basic mechanisms responsible for extreme events.
The importance of rare trajectories in chaotic systems, including methods designed to find and sample them, have been studied through different perspectives [9][10][11][12][13][14][15][16][17][18][19][20]. For example, the method stagger and dagger, used to find longliving trajectories of transiently chaotic systems, has been an important tool to characterize chaotic saddles [10,11]. Lyapunov weighted dynamics is a population Monte Carlo method that was successfully applied to find trajectories with atypically low or high chaoticity [16][17][18]21,22]. Similarly, genealogical particle analysis was applied to compute rare events in simple climate models [20]. Another powerful and widely used method within trajectory sampling is transition path sampling [13,14], that has been used to sample rare trajectories (e.g. chemical reactions), typically influenced by thermal noise [14]. These are typically trajectories that transit from one stable configuration to another stable configuration [14,23]. These different methods achieve their goal through different, often ingenious, solutions. Here we aim to construct methods that are applicable to different classes of problems and to quantitatively understand the impact of different parameters and choices on the efficiency of the algorithm. We then show that some of the existing methods can be derived from our construction under suitable approximations.
Our methods are based on Metropolis-Hastings (MH) importance sampling [6,7], a well established numerical technique that has been used in Statistical Physics to study rare events since the 1950s. MH produces a random walk x → x in the phase-space of the system that generates initial conditions leading to extreme events (rare states) more often than they would be found by chance, consequently reducing the computational cost associated with their rareness. The flexibility of MH is confirmed by its success in numerous fields in Physics, Chemistry, Finance, among many others [6,7]. While transition path sampling [14], Lyapunov weighted dynamics [16], and genealogical methods [20] already use importance sampling in deterministic chaotic systems, three fundamental questions remain largely open: (1) can MH be used efficiently to systematically sample rare trajectories of (deterministic) chaotic systems? If yes, (2)

how and (3) at what (computational) cost?
We answer these questions with a framework that constructs MH algorithms to efficiently sample rare events in chaotic systems. The crucial step our framework solves is to derive a proposal distribution that bounds the MH acceptance rate for a broad class of chaotic systems and observables (e.g. maximal Lyapunov exponent, escape time). More specifically, we show how to incorporate properties of trajectories of the dynamical system in the MH proposal distribution and that this is a necessary condition to obtain efficient algorithms. This allows us to efficiently sample rare events in different problems and classes of chaotic systems. Proposal distributions existent in the literature [11,[23][24][25] can be derived as limiting cases of our general framework. We expect the ideas and formalism presented here to find applications in other problems of chaotic systems and in the study of extreme events more generally. Therefore, we expect our results to be useful both to those studying extreme events in non-linear systems and to those studying numerical techniques.
The paper is organized as follows: Section 2 reviews traditional numerical problems in chaotic systems and shows how they can be formulated as a problem of sampling rare trajectories; Section 3 introduces the MH algorithm as a method to perform importance-sampling simulations in chaotic systems, and shows that naive approaches do not lead to an efficient MH; Section 4 shows how to incorporate general features of chaotic systems, such as exponential divergence or self-similarity of some of its properties, in the proposal distribution in order to achieve an efficient MH; Section 5 presents numerical tests of the general framework that confirm the applicability of Monte Carlo algorithms to sample rare events in different classes of chaotic systems; Section 6 summarizes our results and discusses its implications.

Review of problems
We consider dynamical systems whose states x in a phase space Ω, x ∈ Ω ⊂ R D , evolve in time from an initial condition x = x 0 according to where F (x) ∈ Ω and F t is F composed t times, F t (x) ≡ F (F (...F (x)...)). Such discrete-time systems can be obtained from continuous-time systems through a Poincare surface of section, a stroboscopic projection, or by the time discretization of the numerical integration of differential equations. We are also interested in the dynamics in the tangent space, which quantifies the divergence between two initial conditions [4]. Specifically, the distance of a state displaced from x by h, x = x + h, to the original state x evolves in time according to Expanding F t (x ) around F t (x) allows x t −x t to be written as where J t (x) ≡ dF t (x)/dx is the Jacobian matrix of F t , and ∂ 2 F t (x)/(∂x i ∂x j ) is the (i, j) entry of the Hessian matrix of F . The first term of equation (3) can be expanded using the derivative of the composition and be written as (4) where J ≡ J 1 and h t evolves in the tangent space according to For small |h|, the growth of x t − x t is characterized by the eigenvalues and eigenvectors of J t [26]. The largest finite-time Lyapunov exponent (FTLE) of a point x can be defined 1 as where μ 1 is the real part of the largest eigenvalue of J t . Thus, when at least one direction is unstable (μ 1 > 1), x t − x t increases exponentially with time, and at most by When the system is one dimensional, the "Jacobian matrix" is a single number, the product in equation (4) is a product of numbers, and the only "eigenvalue" is the result of this product. Thus, in this case equation (6) can be written as We now describe two computational problems in chaotic dynamical systems.

Variability of trajectories' chaoticity
For a chaotic system, λ L ≡ λ t→∞ > 0 in equation (6). The variation of the (maximum) finite-time Lyapunov exponents λ t (FTLE) across different initial conditions characterizes the variation of chaoticity of the system, yielding a distribution of the FTLE, P (λ to ), or, equivalently, the distribution of E ≡ t o λ to : where U (x) is a chosen probability distribution (e.g. uniform in the phase-space). The FTLE and its distribution was introduced in the 1980s [27][28][29] and has been used to study turbulent flows [30], Hamiltonian dynamics [31], chimera states [32], characterize dynamical trapping [27,[33][34][35][36], among others [37,38]. The distribution of FTLE is related to the generalized dimensions [39] and often follows a large deviation principle, where t o is the extensive parameter [19,39]. In strongly chaotic systems, the distribution of FTLE is Gaussian [39], whereas for intermittent chaos and other weakly chaotic systems the distribution is typically non-Gaussian [40]. Figure 1 shows the phase-space dependency and distribution of the FTLE in one chaotic system. It contains characteristic features observed in strongly chaotic systems: λ t is a quantity that depends sensitively on the state x, and its distribution P (λ t ) decays exponentially to zero on both sides. In the limit t o → ∞, P (λ to ) → δ(λ to − λ L ).
The tails of the distribution P (E) play a significant role in the characterization of chaotic systems, as, for example, the higher moments of the distribution are related to higher qs in the generalized dimensions D q of the attractor [39]. Furthermore, the regions of the phase-space with small (large) finite-time Lyapunov exponent are associated with slow (fast) decay of correlations [29], and their characterization has been used to get insight on whether the system is ergodic or not [31]. Moreover, trajectories characterized by a low or high finite-time Lyapunov exponent can play a significant role in the dynamics of interfaces in chaotic flows [30] and others [16].
A typical analysis of the FTLE is to measure how a quantity, W (x), depends on the FTLE λ to [29][30][31]. This requires estimating an integral of the form where W (x) is the pre-selected quantity, Γ is a preselected sampling region (often Γ = Ω), and U (x) is the weight attributed to x (often uniform, U (x) = 1/|Γ |). Let us look for two examples. First, consider the problem of estimating the distribution P (λ to ), e.g. references [16,29,31,40]. The traditional technique is to sample M states x according to U (x) = 1/|Γ | and estimate P (λ to ) using the estimator M λ /M , where M λ is the number of samples with λ to (x) ∈ [λ to , λ to + Δλ to ]. Formally, this corresponds to W (λ to ) with W (x) = 1. The second example is retrieved from reference [29]. There, in order to evaluate the contribution of the algebraic region of the phase-space to the power spectra, the authors decomposed it in two terms corresponding to the power spectra of trajectories with low FTLE and trajectories with high FTLE. This required estimating the power spectra from a set of trajectories conditioned to an interval of λs on the tails of the distribution. Associating the power where f is the frequency, the power spectra represented in Figure 4 of reference [29] corresponds to integrals (for different frequencies f ) given by which contains integrals of the form of equation (10).
Numerically estimating the integral in equation (10) is challenging for two reasons: first, P (λ to ) decays with t o (lower panel of Fig. 1): the distribution of FTLE often follows a large deviation principle, P (λ to (x)) ∝ exp (t o s(λ to )), where s is intensive in respect to t o and is often concave [39]. Consequently, the traditional methodology of sampling states uniformly to find or sample states with increasing t o requires an exponentially high number of initial conditions. Second, the dependency of λ to (x) on x has multiple local minima and maxima (upper panel of Fig. 1). Such rough (fractal [41]) landscapes are known to challenge numerical techniques (e.g., simulations get trapped in local minima or maxima) [42].
The problem of finding and sampling states with high or low-λ has been addressed in the literature with numerical techniques that go beyond traditional uniform sampling [16,18,19]. Such techniques have been successfully applied to find [16] and sample [18,19] states with extremal λs in different chaotic systems. References [16,19] use a population Monte Carlo canonical ensemble where λ t plays the role of the energy E to find or sample states with high or low FTLE. The method computes stochastic trajectories that, from the numerical tests performed, are indistinguishable from (deterministic) trajectories of the system. Reference [18] proposes a flat-histogram simulation to find high or low chaotic states by developing an observable to quantify the chaoticity of the state.

Transient chaos
The best known examples of chaotic systems have a fractal attractor in the phase space [4], the Lorenz attractor being the most prominent example [5]. However, chaotic dynamics can appear also when the fractal invariant set in the phase space is not purely attracting, e.g. it may have stable and unstable directions (a saddle). Trajectories close to this set perform chaotic motion for an arbitrarily long (yet finite) time. This phenomenon appears in a variety of physical systems and is known as transient chaos [43,44].
Numerical investigations of transiently-chaotic systems are computationally difficult because most trajectories quickly escape the vicinity of the chaotic saddle (on which the chaotic dynamics is properly defined) [11,12,45]. with the state x, shows an intricate landscape with multiple local and global maxima. The map is the 4-dimensional coupled Hénon map (defined in Appendix A), and the two dimensions are a surface of section on the plane x2 = 0, y2 = 0. Lower panel: the exponential decay of the distribution of escape time of (1) the Coupled Hénon map with D = 4 and (2) the Standard Map, both defined in Appendix A. P (te = t) was computed by uniformly drawing 10 6 states x ∈ Γ , and measure the relative number of times that te(x) = te.
More precisely, the escape time of a state x in a preselected region x ∈ Γ ⊂ Ω is defined as the first passage time of its trajectory to a pre-selected exit set Λ ⊂ Ω: Almost all trajectories start at Γ and eventually leave when x t ∈ Λ, but they do so at different times t e . Such variability is quantified by the distribution of escape time: the probability that a random initial condition x leaves at a given time t e , where δ te,te(x) is the Kronecker delta and U (x) is the probability density assigned to each state in Γ , which is often constant, U (x) = 1/ Γ dx ≡ 1/|Γ |. Figure 2 shows the typical features of t e (x): it strongly depends on the initial state x and its distribution P (t e ) decays exponentially to zero (i.e., it has a constant escape rate κ).
There are two main numerical techniques to study transiently chaotic systems. The first is to find one long living state x with t e (x) 1 and to compute an average over states of this trajectory [10,11]. For large t e (x), the trajectory between times [t e (x)/2 − t s , t e (x)/2 + t s ] where t s t e (x)/2 is close to a trajectory on the chaotic saddle. When the saddle is ergodic, an average over this long trajectory corresponds to an average over the natural invariant density and therefore an average over these states characterizes invariant properties of the system. The second technique, which we focus in this work, is to compute averages over an ensemble U (x) of initial conditions x that leave the system at time t e (x) = t e [10,44]. For small t e observations depend on the particular initial density U (x) (a point of interest in itself [44]), while for large t they characterize invariant properties of the system (like in the previous technique, the states F t/2 (x) with t → ∞ are independent samples of the natural invariant density).
A typical analysis within sampling transiently chaotic systems is to measure how a quantity, W (x), changes with increasing escape time t e . Numerically, this can be written as Let us enumerate 3 examples of computational problems that can be interpreted as numerical estimations of an integral of the form of W (t e ) in equation (14).

Compute the escape time distribution
Numerically, P (t e ) is often computed by drawing states from U (x) (e.g. uniform density), and counting the relative number of states that exited at escape time t e [10]. This corresponds to W (x) = 1 in which case equation (14) reduces to equation (13).

Compute generalized dimensions of the chaotic saddle
The generalized dimensions are an important property of the chaotic saddle and its calculation is often performed by box counting [4,10,46]. Essentially, the phase-space is divided in equal, non-overlapping, and space-filling boxes i = 1, ..., B(ε) of linear size ε (intervals in 1 dimension, squares in 2, etc.) and the generalized dimension D q of exponent q is proportional to log B i (ε)μ q i , where μ i is the fraction of points of the saddle that belong to the box i [46]. Numerically, μ i is estimated by first obtaining a set of points x j in the saddle and then counting how many are in the particular box i. Such an estimate can be written as the expectation of an indicator function that tells whether a state in the saddle, F te/2 (x) for t e (x) 0, is inside the box i, W (x) = δ F te /2 (x)∈i . This expectation can be written as a conditional expectation of W (x) over states that leave at time t e , Computing this essentially requires computing integrals of the form of W (t e ) in equation (14).

Compute the distribution of FTLE on the chaotic saddle
The distribution of FTLE P (λ) is another important property of the chaotic saddle [10,46]. The FTLE λ = λ t for a fixed t of an open system is computed for trajectories on the chaotic saddle. Like in the previous problem, each of these trajectories can be obtained by generating a state x according to U (x) (e.g. uniformly distributed in Γ ) that has a large escape time, t e (x) 1, and compute λ t (F te/2 (x)) using equation (6) for a fixed t. The distribution of FTLE is then computed from an ensemble of these high-escape-time states by constructing different bins of the histogram I λ = [λ, λ + Δλ], and numerically compute the relative number of states in each bin. Formally, this equates to compute the expected number of states with a given escape time t e whose FTLE is in a bin, and thus corresponds to computing a conditional expectation of the form of equation (15) with

Summary: numerical problems in the study of rare events in chaotic systems
To provide an unified treatment of the numerical problems in transient chaos and in computing FTLE of closed systems we use a common notation whenever possible. It indicates also how the methods can be generalized to different problems. Firstly, there is a quantity that we call "observable" and denote by E(x): in strongly chaotic open systems.
Secondly, we use δ(E, E ) to denote both the Kronecker delta and the Dirac delta (δ(E − E )), for discrete and continuous case respectively. This is needed because, even though in practice we will always consider E to be a discrete function due to binning of the histograms, formally the observable E is either discrete (t e for maps) or continuous (λ t or t e for flows). Thirdly, the observable E is a function of the phase-space of the system but typically depends also on an external quantity N that parameterizes how high E can be: This parameter is important to us because, the higher it is, the rarer a state with the maximum or minimum possible E is. In this notation, the two general problems presented in the previous sections can be summarized as follows: the analysis is conditioned to a projection of x into a one-dimensional observable E(x); the probability distribution of E, P (E), decays exponentially with increasing E; the focus of the analysis is on rare states in respect to P (E): states with E in one of the tails of P (E); the rareness increases with N .
In transient chaos problems we are interested in states with increasing N = t max that correspond to the tail of P (E) = P (t e ), P (t max ). In computations of the FTLE in closed systems we are interested in states with increasing N = t o and on the tails of P (E). These problems share two distinct computational problems: find rare states [10][11][12]16] and sample rare states [18,19,24,25], which can be formalized as follows: -Find rare states: minimize or maximize E(x) over a constraining region Γ , x ∈ Γ ⊂ Ω, for increasing N ; -Sample rare states: compute the integral of a function W (x) conditioned to a particular value of E and for increasing N , over an ensemble of states x distributed according to U (x) in a constraining region of the phase-space x ∈ Γ ⊂ Ω: The numerical challenges of these two numerical problems are common: the states x are exponentially difficult to find with increasing N (or E), the function E(x) contains multiple local and global minima embedded in fractal-like structures, and a potential high dimensionality D of the phase-space (see Figs. 1 and 2). The goal of this paper is to develop a systematic approach to tackle the two numerical problems and these three challenges.

Importance sampling
As described in the previous section, the traditional methodology to compute an integral of the form of equation (16) is to draw m samples {x i } distributed according to U (x) from the relevant region Γ and approximate the integral W (E) by the estimator where . For the estimator in equation (17), this is given by where G(E) is the density of states: the number of states per bin with an observable E. 2 Therefore, the number of 2 G(E) = P (E) when each state is equally weighted, U (x) = 1/|Γ |. samples m * required to achieve a given precision * for a given E * is m * (E * ) ∝ 1/G(E * ). The critical problem in sampling rare states is that, because G(E) decays exponentially with E, m * (E) increases exponentially with E.
Importance sampling techniques aim to improve this scaling by drawing samples from a distribution π(x) = U (x) on the phase-space (e.g. non-uniformly in the phasespace) [7]. Specifically, consider m independent samples {x i } drawn from π(x), and the function π(x) to depend only on E, π(x) = π(E(x)) = π(E). Because the samples are not drawn from U (x), the estimator in equation (17) would be biased. Importance sampling uses an unbiased estimator for W (E) given by [7] which reduces to equation (17) when π(x) = U (x). The advantage of importance sampling is that the relative error of the estimator in equation (19) is given by This is because, when sampling from π(x) = π(E(x)), the expected number of samples with a given E, m(E), is equal to Equation (20) implies that the function π(E) can be chosen to favour states x with observable E on the tails of G(E) and therefore improve the precision of the estimator on these tails. The standard deviations in equations (18) and (20) were obtained assuming that the m samples were independent. In traditional methodologies such as uniform sampling, this is the case. However, in the algorithms discussed below, it is not. Therefore, it is necessary to modify equation (20) for the case where the samples {x i } are drawn from π(x) and are also correlated. This modification is given by where T (E) is the autocorrelation time [7], which increases with the correlation of the samples. Efficient importance-sampling [6][7][8] techniques have to address the following three steps: 1. choose a suitable π(x); 2. have a method to generate samples from π(x); 3. minimize the autocorrelation time T .
A defining point in our method is our choice for the Metropolis-Hasting algorithm to address point 2, differently from references [16,20] which address similar problems through a different choice for point 2 (cloning trajectories). Below we first discuss point 2, then point 1, and finally point 3.

Metropolis-Hastings algorithm
The Metropolis-Hastings (MH) algorithm asymptotically generates states x according to π(x) using a a Markovian, ergodic and detailed balance random walk in the sampling region Γ [7]. This random walk is initialized from a random state x ∈ Γ and evolves to a new state x ∈ Γ with a transition probability P (x |x) chosen such that asymptotically the states x are drawn according to π(x). In Metropolis-Hastings, P (x |x) is written as where g(x |x) is the (proposal) distribution used to generate new states and a(x |x) is the (acceptance) distribution used to select them. The random walk fulfills detailed balance because the acceptance probability is chosen as [7] a(x |x) = min 1, Metropolis-Hastings algorithm is not the only way to achieve this. Another popular and alternative method to sample from π(x) is population Monte Carlo, which instead of a random walk, uses multiple stochastic trajectories that are cloned and destroyed. See, e.g. reference [19] for an application to the problem of the FTLE described above.
Algorithmically, the MH algorithm is implemented as follows: choose a region Γ and a random initial condition x ∈ Γ . Evolve the random walk in time according to: 1. Propose a state x drawn from g(x |x); 2. Compute a(x |x) replacing x and x in equation (23); 3. Generate a random number r in [0, 1]. If r < a(x |x), make x to be the new x; 4. Store x and go to 1. The set of sub-steps 1-4 brings the random walk from its current state x to the next state and it is called a Markov step. After a transient number of steps where the algorithm converges to the asymptotic distribution, the stored states x are (correlated) samples drawn from π(x), and can be directly used in equation (19).

Canonical ensemble
A sampling distribution π(x) often used is the canonical distribution [6] In the context of rare states, the canonical ensemble is useful because the number of sampled states m(E) in equation (21) becomes In particular, the maximum of m is at E * solution of β = dS/dE(E * ). Therefore, β tunes which value of the observable E is sampled the most. For example, the Lyapunov weighted dynamics in reference [19] uses this distribution (Eq. (15) of the reference with α replaced by β).

Flat-histogram ensemble
Another distribution often used in the literature of Metropolis-Hastings [47][48][49][50][51][52][53][54] is the flat-histogram (or multicanonical), given by [52,53] π for a given choice of E min , E max that defines the region of interest on the observable E. This is known as flat-histogram because, replacing equation (26) in equation (21) leads to a constant average number of samples on each E, Consequently, the dependence of the variance in equation (22) is only due to the autocorrelation T (E), which implies that the computational cost to draw a state on the tail of G(E) is no longer driven by the exponential decrease of G(E), but by the computational cost to draw uncorrelated samples from π(x).
The main limitation of the flat-histogram is that it requires knowing G(E) in advance, which is very often unknown. The most well known, that we use here, is the Wang-Landau algorithm [53], that modifies the Metropolis-Hastings algorithm to a non-Markovian chain that asymptotically converges to a flat-histogram Metropolis-Hastings. The Wang-Landau algorithm starts with an approximation of G(E), G W L (E) = 1/|E max − E min |, and, on step 4 of the MH (see algorithm in Sect. 3.2), it multiplies G W L (E(x)) (x is the current state of the random walk) by a constant f > 1. After a given number of steps, f is reduced by a factor 2 and this procedure is repeated until a final f min 1 is reached [53]. The value f min and how f is reduced dictates how close G W L (E) will be from G(E) [53,55,56]. An alternative approach to sample states using a broad histogram is replica exchange, see references [17,18,22] for applications in chaotic systems.

Characterisation of the efficiency
The relative error (E) depends on the value of N and on the particular value of E/N . To avoid discussing the dependency on E/N , the efficiency of the flat-histogram is often quantified in terms of the average round-trip [57][58][59], which is an upper bound for the number of Markov steps m (samples) required to obtain an uncorrelated sample from π(x) [59,60]. The round-trip, τ , is the average number of steps required for the algorithm to go from a state x with E(x) = E min to a state x with E(x) = E max and return back. 3 Reference [24] shows how the autocorrelation time of a canonical ensemble is related to the autocorrelation time of a flat-histogram, and therefore we will use here the round-trip time of a flat-histogram simulation to quantify the efficiency for both distributions. The uniform sampling has a round-trip that increases exponentially with N : on average it takes 1/G(E max ) samples to get one sample with E max , and 1/G(E max ) increases exponentially with N .
Importance sampling Monte Carlo is widely used in statistical physics because the computational cost often scales polynomially with N [7,59], which is a dramatic improvement over uniform sampling. Under the hypothe- N and (b) the correlation between the different E(x) of the random walk decay fast enough, it can be shown that the roundtrip τ scales as [25,59] τ For example, consider the problem of sampling states of an open chaotic system with different escape times t e from t e = 1 up to t e = t max e . In this case, E min = 1 and E max = N = t max e . Thus, under the hypothesis (a) and (b) above, the round-trip is expected to scale as There are known deviations of the scaling in equation (28) leading to a higher exponent, N 2+z with z > 0 [25,[57][58][59] and there are two common situations where this happens: (1) the acceptance rate decreases drastically with N (hypothesis (a) above is violated); (2) autocorrelation drastically increases with N (hypothesis (b) above is violated) [25,59]. Nevertheless, these do not qualitatively change the argument: Monte Carlo with flat-histogram has the potential of generating rare states polynomially with increasing N , while uniform sampling generates rare states exponentially with increasing N .

Summary: challenges for the application of Monte Carlo methods to chaotic systems
To achieve a MH algorithm that scales polynomially with N , the autocorrelation time needs to be low. A key ingredient for an efficient algorithm is a good proposal distribution g(x |x) [61]. The ideal proposal distribution of a Metropolis-Hastings algorithm draws x independently of x according to π(x ), g(x |x) = π(x ). This is because (i) the acceptance in equation (23) is always one and (ii) each step of the random walk generates an independent sample x, which implies that the error in equation (22) is minimal. The difficulty of sampling rare events in chaotic systems is that a useful π(x) to sample them is a difficult function to sample from. For concreteness, consider the problem of sampling high escape times in the open tent map defined in Appendix A.3 -equation (A.5), whose t e (x) is represented in Figure A.1 -and consider the canonical sampling distribution, π(x) ∝ exp(−βt e (x)). In this example, π(x) ∝ exp(−β1) between [1/a, 1 − 1/a] and so forth (exp(−βt e )) in subsequent intervals. The number of intervals increases as 2 te . Therefore, sampling from π(x) would require enumerating every interval, sample one at random according to a correct distribution that depends on β, and then sample a uniform point within that interval. While this can be done in simple maps such as the open tent map, this is unfeasible in a general system where the t e (x) dependency is unknown. One approach is to consider g(x |x) to be uniformly distributed over Γ . One could imagine that changing the sampling distribution π(x) would decrease the variance of the estimator because this gives more preference to rarer states. However, this is not the case: changing the sampling distribution alone does not decrease the scaling of the variance of the estimator. This is because changing the sampling distribution (e.g. using a canonical ensemble) leads to an exponential increase of the autocorrelation time T (E) with E, see Appendix B, making it as efficient as traditional uniform sampling. While here we focus on MH, the choice of the proposed state is crucial to Monte Carlo methods more generally, and thus addressing this problem for MH can be useful more generally, e.g. for the methods used in references [17,18,22].
In summary, Metropolis-Hastings is an excellent candidate to approach the numerical challenges found in the study of rare events in chaotic systems. Firstly, because it is grounded in strong mathematical results such as importance sampling theorem and asymptotic convergence of Markov processes. Secondly, because it is formulated with very little assumptions about the system, the observable of interest or the dynamics of the system, which gives enough freedom to adapt it to the specific aim (sampling or finding), observable, and system. Thirdly, because there seems to be no theoretical reason for the sampling to be exponential; Metropolis-Hastings is used to sample rare states in polynomial time in other problems of statistical physics. Finally, because the numerical problems found in chaotic systems can be re-written as problems where Metropolis-Hastings is suitable for. On the other hand, the optimal proposal of Metropolis-Hastings is unfeasible in chaotic systems, and without any extra information about the system, Metropolis-Hastings is as efficient as the traditional uniform sampling.

Proposal distribution
The problem we address in this section is: how to incorporate general properties of chaotic systems into the proposal distribution in such a way that the Metropolis-Hastings algorithm becomes efficient? We first set an aim for the proposal distribution and we then show how this aim can be achieved in the different problems involving deterministic chaotic system.

Aim of the proposal distribution
The goal of the proposal distribution g(x |x) we construct here will be to bound the acceptance rate in equation (23) away from 0 and 1. Since a(x |x) in equation (23) depends on x , it is essential to look at its expectation over the proposal, Our goal is to construct a proposal distribution such that 4 The motivation to set this as the starting point (and cornerstone) of our method is that it avoids the two typical origins of high correlation times T . When the acceptance is low, T increases because the method remains stuck in the same state x for long times. High acceptance typically indicates that x is too close to x (often E(x ) = E(x)), which implies that the simulation moves too slowly (in E and in Ω). The goal here is not that the acceptance is exactly a , but rather that it remains bounded from the extremes and that it does not strongly depend on N .
In general it is non-trivial to construct a proposal distribution that guarantees a constant acceptance. Thus, the next step is to approximate equation (31) by a simpler condition. Let us first notice that π(x) only depends on . Therefore, at the very least, the proposal g(x |x) should guarantee that x is generated from x in such a way that π(E x ) is neither too close (high acceptance) nor too far (low acceptance) from π(E x ). Quantitatively, this can be written as where 0 < a < 1 is a constant. When the proposal is able to achieve a small variation of E, π(E x ) can be expanded in Taylor series around E x = E x , which allows to write Inserting equation (33) in equation (32), an average constant acceptance is thus achieved when. 5 This equation, the main result of this section, is a condition that an efficient Metropolis-Hastings imposes to the proposal distribution in terms of the average difference in the observable E. This condition is non-trivial because it depends on the particular π, E, F , and x. For the sampling distributions discussed in the previous section, we obtain:

Canonical ensemble
When π(x) = e −βEx , the condition in equation (34) is given by That is, the higher the β, the closer the proposed E x has to be from E x .

Flat-histogram
When π(x) ∝ 1/G(E x ), the condition in equation (34) is given by When E x is close to the maximum of G(E), the derivative of log G approaches 0 and E x can be arbitrary distant from E x . As E x deviates from the maximum of G(E), smaller changes are necessary to achieve a constant acceptance. Note that equation (34) is valid for the Metropolis-Hastings algorithm in general and should be of interest also in other contexts (e.g., arbitrarily large number of spins can be flipped close to the maximum of the density of states).

Propose correlated trajectories
The proposal distribution requires correlating the trajectory starting at x with a trajectory starting at x such that equation (34) holds. Fulfilling this requirement requires the ability to control E [E x − E x |x], which requires a procedure to correlate the state x with x. The aim of this section is to introduce a quantification of the correlation of two trajectories of finite-time t o that can be related with , and present two different proposal distributions that propose a state x on which this correlation is controlled.
The observables E introduced in Section 2.3 are all dependent not only of x, but also of the full trajectory of length t o starting x. 6 One natural way to quantify the similarity of two trajectories is the length the two trajectories remain within a distance Δ much smaller than a characteristic length of Γ . Formally, this can be quanti- For t shift = 0, this recovers the case of two trajectories starting close to each other; t shift = 0, it includes situations where a trajectory starts at x close to F t shift (x). This definition is motivated by the concept of symbolic sequences [4]. The similarity of the trajectory starting from x with the one starting from x can be quantified by the number of symbols that both trajectories share, which corresponds to the t (x, x ) in equation (37). The definition in equation (37) avoids the necessity of the existence of a phase-space partition, but, for the purposes of the argument below, the two trajectories share a sequence of t states that are close within Δ. The average correlation between two states whose one is drawn according to a proposal distribution is here defined by (38) Notice that this quantity does not depend on the particular sampling distribution or algorithm; it is a function of the proposal distribution and the state x. The goal of the next sub-sections is to construct proposal distributions that guarantee a given average correlation t (x). We will show, for example, that we can enforce an average correlation t (x) if we use a normal distribution centered around x with a specific standard deviation as our proposal distribution.

Shift proposal
One proposal that guarantees that trajectories are correlated by t is the shift proposal, originally introduced in reference [13] in the context of sampling paths of chemical reactions, and has also been used in reference [23]. It consists in proposing a state x that is a forward or backward iteration of x, x = F t shift (x), where t shift is a free parameter. The relation between t shift = t shift (x) and t = t (x) is that a shift of ±t shift guarantees that t o − t shift elements of the original trajectory are preserved. Therefore, this proposal guarantees that t elements are preserved when |t shift | = t o − t , see right panel of Figure 3. Because detailed balance has to be enforceable, the proposal must contain backward and forward shifts. A proposal that automatically fulfills detailed balance is one on which the backward and forward shifts are equally likely: . Given the target average correlation t (x), this proposal can be implemented as follows: generate a random number r ∈ [0, 1]; if r < 0.5, This proposal unfortunately has some disadvantages: (i) a priori there is no guarantee that F t shift (x) ∈ Γ . It is applicable when Γ = Ω, which, e.g. is not the case in open systems; (ii) it requires the map to be invertible; (iii) the proposal can only propose states that are forward or backward iterations of x. Consequently, for the random walk to be ergodic in the phase-space, the map itself must be ergodic, (iv) this proposal diffuses without drift on a trajectory passing through x, by shifting the starting point forward or backward. Thus, it will always sample fewer states than a time average of a trajectory starting at x. On the other hand, the main advantage of this proposal is that it performs non-local jumps in the phase-space. That is, it allows to jump from a region of the phase-space to another region while still maintaining x correlated with x. As shown below, in combination with other proposal, this proposal is useful to reduce correlations stemmed from local jumps.

Neighbourhood proposal
Another strategy to construct a proposal on which on average the states are correlated by t (x) is to perturb x by a finite amount δ, x = x + δ, characterized by a directionδ and a norm δ, δ ≡δδ. A common case is when the probability distribution is separated in two independent terms [7]: and that P (δ|x) is uniformly distributed in the D directions and P (δ|x) has zero mean (i.e. an isotropic proposal). Here we restrict the analysis to this situation, and we also assume that P (δ|x) is characterized by a well defined scale, e.g. it is an half-normal distribution 7 with mean δ x (x): 7 An exponential distribution would be equally acceptable and would not change the main conclusions. This choice makes the ratio g(x|x )/g(x |x) in equation (23) to be The main motivation for this choice is that the proposal distribution is described by a single function, δ x (x), that quantifies the distance The goal is now to relate δ x (x) with t (x). Let us start to describe two important limits: in the limit δ x (x) → 0, the states are the same and therefore lim δx(x)→0 t (x) = t o . In the limit δ x (x) → |Γ |, the proposal is approximately equal to draw x uniformly from Γ , and x is independent of x and t (x) = 0. To preserve a correlation of t (x), it is necessary that δ x (x) is such that the two trajectories starting at x and x are close together up a time t (x), see left panel of Figure 3. Because the system is chaotic, for small δ x (x), two trajectories diverge exponentially in time according to equation (4), and, in particular, their maximal distance is given by equation (7). Therefore, to guarantee that two trajectories are distanced at most by Δ after a time t (x), δ x (x) must be given by This equation relates the parameter of the proposal distribution, δ x (x), with the average correlation t (x) of the two states x and x . The neighborhood proposal derived above is closely related to a proposal described in reference [23] as "precision shooting". Precision shooting constructs a trajectory {x i } with x 0 ≡ x = x + δδ (where δ is a free parameter) that, within the numerical precision of a computer, is indistinguishable (in the system considered) from a trajectory starting at x with δ small. The trajectory {x i } shadows a true trajectory starting at x , in the same spirit as the algorithm used in reference [64] to construct a pseudo-trajectory. Thus, precision shooting can be interpreted as the neighborhood proposal, equation (43), with t a free parameter (related to δ via Eq. (43)), that assumes shadowing theorem to simplify the construction x . Reference [23] discusses how the acceptance rate depends on δ, suggesting that the acceptance rate increases with decreasing δ (Fig. 9 of the reference). In light of the discussion in Section 4.2, this result is interpreted as follows: as δ decreases, x becomes more correlated with x (since t is related with δ by Eq. (43)), and therefore the acceptance is expected to increase, as indicated in Figure 9 of reference [23]. This discussion is unfortunately insufficient to us because it does not allow to derive δ (or t ) that fulfills the condition in equation (34). The crucial advantage of equation (43) is that it allows to relate E [E(x ) − E(x)|x] (in Eq. (34)) with x −x (in Eq. (43)) through t (in Eq. (37)). This is the goal of the next section.

Guarantee local proposals
Now that we derived proposal distributions that enforce an average correlation t (x), the next (and final) step is to obtain a relationship between t (x) and Because the computation of t (x) depends on the particular observable E, a different derivation is presented for two observables, t e and λ to . Given the limitations of the shift proposal described above, the argumentation below is limited to neighborhood proposals (an equivalent argumentation can be made to the shift proposal).

FTLE in closed systems
As introduced in Section 2.1, the observable in this case is given by where x t ≡ F t (x). Likewise for the trajectory starting from x , Because x is proposed according to equation (43), by construction, the first t states of the trajectory starting at x are close (within Δ) to the states of the trajectory starting at x up to t (x). Therefore, we can approximate that the respective Lyapunovs up to time t are equal, Subtracting equation (44) from equation (45) and using equation (46) gives The left side of this equation is the same as in equation (34) and thus the aim now is to write the right side as a function of properties of the system. Let us focus on the calculation of E λ to−t (x t )|x first. By construction, x is generated such that |x t − x t | ≈ Δ. Because the system is chaotic, one can approximate that x t is sufficiently separated from x t such that λ to−t (x t ) is independent of x. Under this approximation, x t is essentially a random state from the phase-space, and thus λ to−t (x t ) will be a drawn from the distribution of FTLE at time t o − t .
Denoting the mean of this distribution by λ L,to−t , we get Replacing equation (48) in equation (47) gives This equation relates the expected change in the observable with properties of the system (λ L,to−t ), of the trajectory x, λ to−t (x t ), and t (x) and it can thus be used in the energy condition we obtained earlier, equation (34).
Replacing the left side of equation (34) by the expectation in equation (49) and solving to t (x) gives .
(50) We can further simplify this relation with two approximations: (a) for large t o − t , the mean FTLE at time t o − t , λ L,to−t , is approximately the Lyapunov exponent of the system, λ L , (b) because trajectories of chaotic systems are shortcorrelated in time, the FTLE of the trajectory x up to t will be approximately equal to the FTLE of the trajectory starting at x t . Thus, Using equations (51) and (52), t (x) in equation (50) can be written as To enforce that t ∈ [0, t o ], we use which is the main result of this section. This equation provides an expression to t (x) that can be inserted in the parameter of the proposal distribution, equation (43), that under the approximations used above achieves a constant acceptance rate. 9 The derivation of equation (54) can be generalized to other observables which, as λ to , can be written as an average over the trajectory: consider where f (x i ) is an arbitrary function of the phase-space (the logarithm of the derivative of the map corresponds to e to (x) = λ to (x), E to (x) = λ to (x)t o ). Replacing this quantity in the derivation of equation (54) mutatis mutandis and without using the approximation in equation (52), one obtains 9 In reference [25] we used t = to − 1, different from equation (54). The derivation of equation (54) uses stronger approximations than the derivation of t = to − 1. See argumentation after equation (14) of reference [25]. where E L is approximately the mean of the distribution P (E to ) (using the approximation in Eq. (48)). This generalizes equation (54) for an arbitrary average over trajectories of size t o , e to (x), and it should be useful to sample rare states in respect to observables correspondent to expected values over trajectories.

Escape time in strongly chaotic open systems
As introduced before, in strongly chaotic open systems E(x) = t e (x) and P (t e ) ∼ exp(−κt e ). The aim here is to compute t (x) that fulfills equation (34) by taking into account that x is given and x = x +δδ x (x) with δ x (x) given by equation (43). In this case the trajectory's length is not a constant but instead it is given by the escape time t e (x).
A trajectory starting at x proposed according to equation (43) fulfills |x t − x t | ≤ Δ. Therefore, up to t , the two trajectories are indistinguishable. Under this assumption, and because from the definition of t in equation (37), t e (x ) = t (x) + t e (x t ) and therefore where t e (x t ) is the escape time of the state x iterated t times and E t e (x t )|x = Ω g(x |x)t e (x t )dx is the expected t e (x t ) over a neighborhood of x of size δ x (x) (given by Eq. (43)). The idea behind this equation is represented in Figure 4. The proposal density g(x |x) at x is such that, at t , the two trajectories distance themselves on average by Δ. After t , the trajectory starting at x t will approximately be independent of x and thus t e (x ) = t + t e (x t ). Moreover, proposing according to equation (43) is equivalent to "zoom" the landscape of t e (x) around x with a scale correspondent to t 'th iteration of the construction of the landscape. Given the self-similarity of the landscape (see e.g. Fig. A.2), under this zoom, the landscape of t e (x t ) is equal to the landscape of t e (x) and therefore E t e (x t )|x should be a constant independent of t e . Moreover, because x t is approximately independent of x, E t e (x t )|x is the just average escape time of an independent state x t , which is the average of P (t e ) and is given approximately by 1/κ. Thus, E t e (x t )|x is the average escape time of an independent state x t , which is the average of P (t e ) and is given by 1/κ: E t e (x t )|x = 1/κ.
It is this result that incorporates the self-similarity of the escape time function: the value of t chooses the particular zoom of the landscape (Fig. 4), and this equation assumes that, as long as the zoom is proportional to exp(λt ), the average escape time of the phase-space of the zoomed region is still 1/κ. Replacing equation (58) in equation (57) gives and subtracting t e (x) on both sides of equation (59) gives The left side of this equation is the left side of the condition of constant acceptance rate, equation (34). Equating both left sides and solving for t (x) gives which is analogous to equation (54) and is the central result of this section. Together with equation (43), it is the proposal distribution we had the goal of constructing.

Summary: how to propose
The argument in this section can be summarized as follows: -Metropolis-Hastings requires a proposal that guarantees a specific variation of E, which we approximate by equation (34); the proposal with a scale given by equation (43) guarantees that on average the trajectory starting at x stays close to the trajectory starting at x up to t (x); the variation of E is related to t (x) via equations (49) and (60).
These three steps led to analytical formulas for t (x), equations (61) and (54), that make the proposal satisfy equation (34). Together with equation (43), they define proposal distributions required for an average constant acceptance.
There are 3 points that deserve to be noted: the first point is that the formulas we obtained for t (x) are also valid for the problem of finding rare states discussed, e.g. in references [11,12,16,18,65]. Specifically, these formulas were derived to guarantee equation (34), which dictates how different E [E(x )|x] has to be from E(x) to guarantee a constant acceptance. This condition is stronger than the condition required for an algorithm to find minima or maxima of E, which requires only proposing states x such that E [E(x )|x] > E(x) (or vice-versa for minimizing E). This is independent of the particular minimization algorithm (e.g. stimulated annealing, step descent, stagger and dagger in open systems) because it only discusses which new state x should be tried, given the current state x. The second point is that equations (61) and (54) reduce the proposal to a uniform distribution when the sampling distribution is the uniform distribution: d log π(E)/dE(E) → 0 implies δ x (x) → ∞. The third point is that different proposals in the literature, precision shooting [23,25], exponential proposal distribution [11,18], and the one in reference [24], can be obtained from the proposals derived in this section through different approximations, as shown in Appendix C.
Overall, this section described a framework to add information of chaotic systems, in this case the self-similar properties of the landscape, the exponential divergence of trajectories, and the exponential decay of correlations, to construct an efficient Metropolis-Hastings algorithm to sample them. A simplified description of the algorithm is given in Appendix D and an open-source implementation in reference [66]. The next section is devoted to test the assumptions used here on each of the problems, escape time and FTLE, and confirm the practical usefulness of the framework.

Numerical tests
The previous section concluded with a set of formulasthe proposal in equation (43) combined with the formulas for t (x) equation (54) or equation (61) -for proposing states x that are expected to guarantee the desired acceptance rate (bounded from 0 and 1). In this section we test some of the approximations made in the derivation of t (x) (in a simple system) and we analyze the efficiency of the algorithm (confirming the polynomial scaling) in the computation of the FTLE in closed system -introduced in Section 2.1 -and of the escape time in open systemsintroduced in Section 2.2. The tests are performed in the skewed tent map and open skewed tent map, and the efficiency is tested in numerous maps (chain of couple Hénon maps, Standard map, Logistic map). All these systems are introduced in detail in Appendix A.  5. The finite-time Lyapunov exponent of the first half of the trajectory is independent from the one of the second half of the trajectory. The graph was obtained by sampling 10 5 random initial conditions xi and compute λt(xi), λt(F t (xi)) = (first half, second half) of a 2t = 16 steps trajectory. The y axis represents the mean (full black) ± 2 standard deviations (full blue) of λt(F t (xi)) conditioned to a given λt. The dashed lines represent the same mean and standard deviation, but over all points (without conditioning). When λt(F t (xi)) is independent of λt(xi), the dashed and full lines are the same within fluctuations, as observed. A Kolmogorov-Smirnov test comparing the un-conditioned and conditioned distributions gives a p-value higher than 0.001 (hypothesis that they are independent is not rejected).

Finite-time Lyapunov exponent
The first approximation made in the derivation of equation (54) is that when δ x ≡ |x − x| is drawn from a half-normal distribution with scale parameter δ x (x) given by equation (43), F t (x ) is sufficiently close from F t (x) such that E [λ t (x )|x] = λ t (x), equation (46), holds. We test this numerically by randomly drawing 10 5 states x i in the tent map with a = 3 (see Appendix A), and, for each, propose a state x i = x i + δ(x) according to equation (41) with δ x (x) given by our equation (43). From the 10 5 pairs of states (x i , x i ), we estimate the difference of the observables: E ≡ E [tλ t (x i )|x] and E ≡ tλ t (x i ). Our expectation is that for a fixed t = t, E − E should be much smaller than E (the relevant scale in Eq. (47)). We numerically obtain that within the 99% quantile, E − E ≈ 1 independently of E and t. This value is much smaller than E ∈ [0.6t, 1.1t], specially since we are interested in large t. This strongly supports the approximation we make in equation (46).
The second approximation made in the derivation of equation (54) is that there is no dependence between λ t (x) and λ to−t (F t (x)), equation (48). In other words, that the FTLE of the trajectory starting at F t (x) and ending at t o is indistinguishable from the one drawn from the distribution of FTLE with finite-time t o − t , P (λ to−t ). This approximation was numerically tested by drawing points x i , computing the pairs λ t (x i ), λ t (F t (x i )) (i.e. 2t = t o = 2t), and testing whether the conditional probability of λ t (F t (x i )) equals the (unconditional) probability of λ t (x i ). The results in Figure 5 confirm the equality of these probabilities. We finally test whether a proposal using t (x) given by equation (54) guarantees a constant acceptance rate, the original motivation for our calculation. The test consisted in sampling 10 6 states according to the following procedure: (1) uniformly draw a state x i and compute λ i ≡ λ to (x i ); (2) generate a state x i according to the proposal distribution equation (41) with δ x (x) given by equation (43) and t given by equation (54) (42): δ x is given by equation (43), and |x − x| is given by storing δ i = |x i − x i |. The ratio of the target distribution is given by r i ≡ π(E )/π(E) = exp(−βt o (λ i − λ i )) for the canonical ensemble and r i = G(E)/G(E ), where G(E) = G(t o λ i ), is given by equation (A.3) for the flat-histogram. The numerically estimated acceptance ratio, A(λ t ) ≡ min(1, r i × G i ) is shown in Figure 6. It is not independent of λ to -the expected outcome based on the assumption of constant acceptance used in our derivations -but it is bounded from 0, 1 for increasing N = t o -the original requirement for an efficient simulation set in Section 4.1. In the canonical ensemble, there is linear dependency of Π with λ to , and in the flat-histogram ensemble, the ratio is 0.8 in the maximum of P (λ to ), and decays to about 0.1 on the tails.

Efficiency of the flat-histogram ensemble
The success in achieving a bounded acceptance independent of N , as the one confirmed in the previous section for the tent map, does not guarantee the efficiency of the method (see Sect. 3.4). Here we test the efficiency of flathistogram computations of P (λ t ) in the tent map that use the neighborhood proposal, the shift proposal, and both (mixed proposal). The results shown in Figure 7 suggest that only when both proposals are used the efficiency scales polynomially with This result can be understood looking at the landscape of λ t , as illustrated in Figure 8. Imagine a flat-histogram simulation on this system, for t o = 4 (black curve in the figure), and analyze what happens to it in terms of a round-trip. Lets suppose that the simulation was recently at the minimum λ t (0.41) and that the next round-trip is made by going to the maximum λ t (1.09) and return back. Lets further suppose that the simulation eventually got to a state with λ t ≈ 0.92. Because π(x) = π(λ t (x)), every state at that λ t is equiprobable. Therefore, the state can be at any plateau (of the 4, see Fig. 8), proportionally to their plateau-size. However, not every plateau contains, on its neighborhood, a neighbor plateau with higher λ t , for example, the plateau around 0.3. Therefore, a local proposal would never be able to reach a higher plateau from a state on such a plateau. First, it would need to go backward, reach the maximum of P (λ t ) (around 0.69), where the proposal proposes any other state, and then try to find another path towards a higher λ t . This would already be the problem if the simulation would be a canonical ensemble with a β favoring higher λ t 's, since it would require decreasing λ t by an amount Δλ, and this happens with a probability that decays exponentially with Δλ. This is solved by using a flat-histogram ensemble. However, the crucial challenge here is that as t increases, the number of local maxima also increases, but the number of maxima connected with the global maximum is constant: the red curve, with t = 6, contains now 11 plateaus for λ t ≈ 0.87, but only 1 is locally connected to the maximum λ t , the one around 0. This means that, as t increases, it becomes more difficult to perform a round-trip: not only because the expected time to diffuse increases (see Eq. (28)), but also because there are more times where the simulation diffuses forward and backward until it reaches the global maxima. The shift proposal alleviates this problem by allowing nonlocal proposals in the phase-space. A shift proposes a state x on a non-neighborhood of x, which improves the probability of reaching higher or lower λ's, which explains why a combined proposal has such a low round-trip time in this system. Finally, we confirm that a polynomial efficiency is obtained in the computation of P (λ to ) more generally. We performed flat-histogram simulations, using the Wang-Landau algorithm to estimate P (λ to ) on different chaotic systems: the tent map, the logistic map, and the standard map (see Appendix A for details). The results are shown in Figure 9 and confirm the dramatic improvement and generality of using Metropolis-Hastings to sample rare states. The simulations in Figure 9 use t (x) = t o − 1 instead of the one given by equation (54). This is computationally always more expensive because the correlations due to the neighborhood proposal are maximal (see the discussion after Eq. (54)), but on the other hand this proposal is simpler because it does not require estimating d log P/dE and λ L .  [4]: K = 8 and U (x) = const. was used in every case. The Wang-Landau algorithm was used to estimate the distribution prior to perform the flat-histogram and the distribution agrees with the analytical one when available [39,40]. The proposal distribution used was a mixed proposal composed by 50% chance of being the neighborhood proposal with t = to − 1 and Δ = 1, and 50% chance of being the shift proposal with t shift = 1. Adapted from reference [25].

Transient chaos
The derivation of equation (61) uses the assumption that when x is proposed with a scale given by equation (43) with t = t e , E [t e (x )|x] = t e (x), as per equation (57). We tested assumption in a similar way we tested the approximation of equation (46) for the FTLE, and consisted in uniformly drawing states x i , compute their escape time t e ≡ t e (x i ) and, for each, generate a state x i = x i + δ(x) with a scale δ x (x) given by equation (43) with t (x) = t e (x) and compute its escape time t e ≡ t e (x i ). The assumption is valid when, on average, t e − t e t e for large t e . We did this experiment in the following systems: open tent map with a = 3 and b = 5, standard map with K = 6, and coupled Hénon maps with D = 2, 4, 6, 8 (see Appendix A) for Δ = 1. We observed that in all 6 cases, the average t e − t e is smaller than 1 for all t e > 2/κ. These observations show that the assumption of equation (57) is valid for a broad class of chaotic systems.
A second assumption tested here is the self-similarity argument used in deriving equation (58). The selfsimilarity assumption we use in equation (58) is that the landscape is self-similar such that, irrespectively of the particular scale we choose (by decreasing t ), the properties of the zoomed phase-space remain the same. In particular, we are interested in checking that t e − t e does not depend on t e when we choose a larger scale, i.e. when t is decreased by a constant. We thus repeat the experiment above but we decrease t to t (x) = t e (x) − T with T = 0, 1/κ, 5/κ to check that increasing T decreases the average t e − t e without changing its independency with t e . with three different exponents. When the exponent is larger than λL, δx(x) effectively aims for a smaller t and therefore a larger distance t e (x) − te(x), consequently decreasing the acceptance. When the exponent is smaller than λL, δx(x) aims for a larger t than the one given by equation (61), and therefore t e (x) − te(x) decreases to 0 as te(x) → ∞, and the acceptance converges to 1. Adapted from [24].
Our numerical tests in the same systems as before show that t e − t e decreases with increasing T and it remains independent of t e , confirming our hypothesis.
The previous tests indicate that the proposal distribution should induce a constant acceptance rate when the Lyapunov exponent of the system, equation (C.2), is used. To confirm that this is the case, a flat-histogram simulation with an isotropic proposal distribution with width δ x (x) = δ x (t e (x)) given by equation (C.2) with λ L (t e ) = λ L was made. The results, Figure 10, reproduced from reference [24], confirm that proposing with λ L guarantees a constant acceptance, and that any other exponent in equation (C.2) fails to achieve so.
The above tests confirm that the derivation made in Section 4.3.2 holds for a paradigmatic strongly chaotic open system. These tests also present a major advantage of using the approach in this paper: it allows to test the assumptions made on each step, something that other approaches, such as the ones in references [11,18,19], do not explicitly allow.
The efficiency of the simulation is tested in Figure 11 as a function of the maximal escape time considered, t max , τ (t max ) in the generic coupled Hénon maps defined by equation (A.8). It confirms the dramatic improvement of Metropolis-Hastings with the proposal derived in Section 2.2 over uniform sampling: the scaling is polynomial using importance sampling, and exponential in uniform sampling.
The derivation in Section 4.3.2, the tests presented above, and results in Figure 11, show the why and how importance sampling Metropolis-Hastings can efficiently sample long-living trajectories in strongly chaotic open systems. The proposal distribution should be applicable to strongly chaotic open systems more generally, as the approximations made are expected to be valid in other strongly chaotic systems.

Summary of results
We have introduced a framework to sample rare trajectories in different classes of chaotic systems. It is based on the Metropolis-Hasting algorithm, a flexible and wellestablished Monte Carlo method suitable for the investigation of many numerical problems in chaotic dynamical systems (as shown in Sect. 2). Our main contribution is a procedure (see Sect. 4) to construct the proposal step of the Metropolis-Hasting algorithm (which proposes a new state x given the current state x) that ensures the efficiency of the sampling. The main arguments in the construction of this procedure are:   Table 1.
A successful application of our framework leads to an algorithm in which the number of samples to obtain an independent rare sample scales polynomially with the difficulty of the problem, as opposed to the exponential increase observed in traditional uniform sampling (see Figs. 9 and 11).

Comparison to previous results
The importance of the proposal distribution has long been emphasized for Monte Carlo methods [7,61] and for sampling chaotic systems [11,13,14,16]. The questions that remain open from these works, and that we tackle in our paper, are how the efficiency of the sampling method is related to the proposal distribution and how an efficient proposal can be constructed from assumptions about the system. The proposals derived in this paper can be mapped, under appropriate simplifications, to known results from the literature, specifically to the proposals derived references [24,25] and the stagger part of the algorithm presented in reference [11]. We make this connection explicitly in Appendix C. Another example of an algorithm that can be directly analyzed by the framework developed here time: one can flip all spins at once because, there, correlating such states brings no advantage to increase the acceptance rate (it is going to be accepted anyway), but it increases r.
is the precision shooting proposed in references [23,67] and used in numerous applications of transition path sampling to chemical reactions [67][68][69][70]. The precision shooting method proposes a state x isotropically distanced from x by δ x (x) given by equation (43), with t (x) = t o where t o is the length of the trajectory. From our results we conclude that this proposal is sub-optimal because it overcorrelates the proposed state. Another application of the results of Section 4 is in the algorithm Lyapunov Weighted Dynamics of references [16,19], which uses a population Monte Carlo algorithm. As mentioned in these references (see also Ref. [20]), there is a parameter ε that controls how far new clones x should be distanced from the existing clone x, and that it should be neither too small nor too large. This plays a role similar to δ x (x) in Section 4 and an optimal ε should therefore be related to our results. In comparison to previous sampling methods in dynamical systems, including those mentioned above and others (e.g., Ref. [11]), the distinguishing feature of our results is that they provide an explicit connection between the proposal distribution and the acceptance rate. This connection, which is typically absent in Monte Carlo methods more generally, is extremely powerful because failures of the algorithm can be related to violations of the hypothesis (about the method and dynamical system) that we used in our derivations. Such violations should then be understood, and this understanding can then be inserted back in this methodology to generate new methods adapted for that situation. We hope this process will increase the range of applicability of our framework to other classes of dynamical systems (e.g., non-hyperbolic systems [33,44], see Ref. [71]) and observables E [72]. Possible improvements of our results can be obtained considering anisotropic search domains, an idea that has shown to be essential in the case of finding chaotic saddles in systems with more than one positive Lyapunov exponents [45].

Discussion
The proposal distribution is a way of moving in the phasespace stochastically and how to select a new state x from a given state x is a general problem in different numerical techniques. Some of the most successful numerical algorithms in the literature, such as the golden section search or gradient descent, are essentially generic and efficient ways of selecting a new state. The results in Figures 7,  and 9 show that the proposal distribution strongly influences the computational cost of the different procedures, often irrespectively of the particular sampling procedure (canonical or flat-histogram) and problem (sampling or finding). This reinforces the notion that the proposal of the new tentative state from the current state is a crucial factor when developing numerical techniques for optimization and numerical integration. Therefore we can expect that the main insights of this papers, e.g. the direct connection between fundamental properties of chaotic systems and the optimal proposal distribution, to be useful also for other problems (e.g., to the optimization problem in which one is interested in maximizing or minimizing the observable).
The development of a numerical algorithm requires compromising between how fast it solves a particular problem, and how it is able to solve different problems. One interesting aspect of the algorithms (i.e. proposal distributions) introduced in this paper is that even though they can be made very specific (e.g. propose with the FTLE of the trajectory, Eq. (43)), they can also be made more general (e.g. propose using the Lyapunov exponent of the system, or the power-law proposal distribution, that does not use any specific information about the system). That is, more specificity requires more information (the FTLE of the trajectory) and makes the algorithm more efficient, and less information (only the Lyapunov of the system) makes the algorithm less specific, but typically also less efficient (proposals using less information, such as a power-law proposal, can be more robust in complicated cases [45]). This demonstrated capability of this methodology shows how it is not only useful to study a particular system on which some information about it is known, but also useful to situations on which less is known about the system. This does not mean that the methods apply to all problems, as there are important classes of chaotic systems on which some of the assumptions used in Section 4 are violated. For example, in non-hyperbolic systems [33,44] trajectories may remain correlated for a long time, which implies that one cannot assume that after t the trajectories are independent. Nevertheless, because the framework was outlined in the form of adding known information about the system, it is possible that improved insights about a class of chaotic systems can be translated to a faster algorithm.
One advantage of the methodology presented here is that it is not restricted to specific observables E(x). In principle, it can been used to construct proposal distributions to sample rare states in different observables E, E(x) = tλ t , and E(x) = t e (x). While it remains unclear what is the precise class of observables for which our methodology allows to construct an efficient proposal distribution, the different derivations of the proposal distribution do provide insights on the observables for which a proposal distribution could be constructed from properties of the system. As argued at the end of Section 4.3.1, observables computed as an average along the trajectories are similar to the FTLE and therefore the proposal distribution, derived in equation (56) should lead to efficient algorithms in these cases.
Altogether, our results reveal a fascinating interplay between the chaotic nature of some non-linear systems and the numerical techniques available to study rare events. It reinforces the idea that an efficient proposal requires information about the system (e.g., our derivation of the proposal distribution used the fact that "trajectories diverge exponentially" and that "the escape time function is a fractal-like function"). The analysis of this interplay allows to both better understand these systems and better understand these numerical techniques. This understanding opens perspectives to develop better techniques to numerical study rare trajectories and extreme events in non-linear systems more generally [72].

Author contribution statement
All authors designed the research, developed the methodology, and wrote the manuscript. J.C.L. implemented the algorithms and performed the simulations.
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Appendix A: Dynamical systems
In this appendix we describe the different dynamical systems that we use throughout the paper.

A.1 Skewed tent map
A paradigmatic example of a strongly chaotic system is the tent map [4], defined on Ω = [0, 1] by where a > 1 is a constant and b ≡ a/(a − 1) This map contains the main features of a chaotic system: it has a positive Lyapunov exponent λ L = a log(b)+b log(a) a+b and a positive measure. The finite-time Lyapunov exponent is given by where i(x) is the number of times x t ∈ [0, 1/a]. Its distribution of the finite-time Lyapunov exponent for U (x) = 1, P (E) = G(E), can be computed analytically and is a binomial,

A.2 Standard map
The standard map considered here is defined by x ≡ (p, θ) ∈ Ω = [0, 1] × [0, 1] that evolves in time according to We focus on the parameter K = 6 which leads to a phasespace with no visible KAM islands. We also consider the leaked (open) version this map by introducing a hole into the system at Λ = [0.1, 0.1].

A.3 Skewed open tent map
The paradigmatic example of a strongly chaotic open system is the open tent map [10], defined on Ω = [0, 1] by where a > 1 and b > a/(a − 1). The state

A.4 Coupled open Hénon map
As a generic example of a high-dimensional strongly chaotic open system, we consider a set of d coupled Hénon maps on a ring, defined by a state x = There are 2 t intervals, and the size of each plateau can be analytically computed from the one at a previous time, and therefore be written analytically. Specifically, a given interval has size ε(x) is the number of times 0 < F t (x) < 1/a, for t = 1, ..., te(x). The crucial observation is that λ te(x) (x) is proportional to log(ε(x))/te(x). Adapted from reference [24].
(x 1 , y 1 , ..., x d/2 , y d/2 ) ∈ Ω = R d where each individual map (x i , y i ) evolves according to This equation is exactly the proposal derived in reference [24], and shows that the proposal derived in Section 4.3.2 generalizes the proposal in reference [24].

C.2 Adaptively estimate the Lyapunov exponent
Using the proposal distribution with λ L requires a priori knowledge of it, which typically is not available. This difficulty resembles the same problem that flat-histogram simulations have: G(E) is required, but it is typically unknown a priori. This analogy motivates a Monte Carlo procedure that on the fly computes δ x (t) that scales with λ L . Consider an hypothetical simulation with an isotropic proposal distribution (Eq. (41)) with δ x (x) = σ(t e (x)), (C. 6) where σ(t) is initially set to be σ(t) = 1 for every t. Consider also that the simulation reached a state x with a high escape time (e.g. t e = t e (x) = 10/κ). A proposed state, x = x+ĥσ(t e ), will most likely have a much lower escape time (e.g. t e (x ) = 1/κ). From equations (43) and (61), this indicates that σ(t e ) is much higher than the "correct" proposal, δ x (x), and therefore it should be reduced in the next proposal. The opposite is also true: when σ(t e ) is much smaller than δ x (x), t e (x ) = t e (x) and it should be increased. This hypothetical simulation suggests that, in the same spirit as the Wang-Landau algorithm to approximate the density P (t e ), there is the possibility to approximate δ x (x) using an update scheme that can be inserted in the Metropolis-Hastings algorithm, and that is given by the same algorithm as the Wang-Landau (see Sect. 3.3.2), but instead of updating P W L (t), it updates also σ(t) [24]: σ(t e ) = σ(t e )f for t e (x ) = t e , σ(t e )/f for t e (x ) < t e . (C.7) This update scheme generalizes the Wang-Landau procedure to the proposal distribution. It is expected to converge to a function σ(t) that decays exponentially with the Lyapunov exponent of the system, and a proposal distribution with a constant acceptance rate in a flat-histogram simulation. It was extensively tested in different systems (tent map, full chaotic standard map with a leak, Coupled Hénon map with different Ds), see references [24,45,71].

C.3 Power-law proposal distribution
The proposal distributions derived in the previous sections requires some knowledge about the state and the system: λ to (x), t e (x) (in open systems), λ L of the system, and, in some situations, G(E(x)). Another alternative to avoid computing δ x (x) is to consider a proposal on which the time t that the two trajectories remain together is not imposed by a given t (x), but that is a uniformly random variable between [0, t o ] (FTLE) or [0, t e (x)] (open systems) that is generated on each proposal. Some values of t will be far from the optimal t (x) and the corresponding x will be rejected or it will be too close from x, but others t will still be close from the optimal t (x) and therefore useful. Having a uniformly distributed correlation t still requires computing δ x (x) in equation (43), which requires λ to (x). In the case λ t (x) is unknown (e.g. in open systems one could be only interested in the escape time and therefore not compute λ t (x)), one may further approximate it by an uniform distribution between two extremes. Because the product of two uniformly distributed random variables is also uniformly distributed, this leads to a proposal distribution where δ x (x) is given by exp (−U (a, b)) where a and b are free parameters. By standard transformation of variables, this leads to a scale δ x (x) that is power-law distributed and given by where δ min and δ max ≈ |Γ | are two free parameters. The δ in x = x + hδ is an half-normal distribution with a scale δ x , but since this scale is now power-law distributed, it is no longer necessary to use the half-normal distribution altogether; instead, it is possible to just use the powerlaw proposal distribution where δ is drawn from P (δ x ) in equation (C.8), i.e. |x − x| is power-law distributed according to equation (C.8). Without the half-normal distribution the proposal distribution no longer depends on x and therefore g(x |x)/g(x|x ) = 1. The stagger part of the algorithm of reference [11] proposes exactly with a scale given by equation (C.8) and the argumentation above explains why the proposal distribution used in reference [11] to find states with high-escape time t e is reported to work well: it is a proposal distribution that proposes x correlated with x with a correlation t that is uniformly distributed, which eventually proposes x with the optimal correlation t (x). To confirm this explanation, let us consider a flat-histogram simulation with a power-law proposal distribution on the open tent map and consider the measurement of E [log δ x |A * , t e ], where A * is the condition ε < a(x |x) < 1 − ε (of bounded acceptance). Under the above argumentation, the scale log δ x that contributes to a bounded acceptance is given by −λ te t e , per equation (C.5). This is confirmed by numerical simulation, shown in Figure C.1, and was obtained also for the problem of finding rare states, as reported in reference [45]. This result, combined with the derivation of t (x), explains the success of the proposal (the stag- ger) used in reference [11] from basic notions of chaotic systems and numerical methods.