1 Background

Diffusions with random switching are stochastic processes consisting of two components: a diffusion process \(X_t\) in \(\mathcal {R}^d\) and a continuous jump process \(Y_t\) on a finite set F. The dynamics of \(X_t\) follows a stochastic differential equation (SDE)

$$\begin{aligned} \hbox {d}X_t=b(X_t,Y_t)\hbox {d}t+\sigma (X_t,Y_t)\hbox {d}W_t. \end{aligned}$$
(1.1)

Throughout, we assume b(xy) is \(\mathcal {C}^{1+\delta }\) and \(\sigma (x,y)\) is \(\mathcal {C}^{2+\delta }\) in x for some \(\delta >0\). The behavior of \(Y_t\) can be described by a transition rate function \(\lambda (x,y,\tilde{y})\), in other words

$$\begin{aligned} \mathbb {P}(Y_{t+h}=\tilde{y}|X_t,Y_t)={\left\{ \begin{array}{ll}\lambda (X_t,Y_t, \tilde{y})h+o(h), &{}\tilde{y}\ne Y_t,\\ 1-\bar{\lambda }(X_t,Y_t)h+o(h), &{}\tilde{y}=Y_t. \end{array}\right. } \end{aligned}$$

We denote the total transition rate as \(\bar{\lambda }(x,y)=\sum _{\tilde{y}\ne y} \lambda (x,y,\tilde{y})\), and the joint process as \(Z_t=(X_t, Y_t)\), which takes place in the space \(E=\mathcal {R}^d\times F\).

Diffusion with random switching is widely used for modeling purpose in many areas [38] and is becoming particularly prominent in the following directions recently:

  1. 1.

    For stochastic lattice models in climate science [12, 14, 15, 25, 26, 31, 33, 48, 49], \(X_t\) represents the dry atmosphere, so (1.1) is the spatial discretization of a fluid equation. Meanwhile, \(Y_t\) represents the unresolved behavior of moisture and clouds.

  2. 2.

    In material science [21,22,23,24] and molecular biology [6, 45], \(X_t\) represents some macroscopic quantities such as the transmembrane electronic potential, and \(Y_t\) stands for the behavior of some particular clusters, proteins, channels and cells.

  3. 3.

    As a simulation strategy for complex processes [8, 33], a Markov jump process can be used as a stochastic parameterization of some subgrid scale processes. It reduces the model dimension and preserves most of the statistical quantities. It can be seen as the \(Y_t\) part in our joint process.

  4. 4.

    In filtering and predictive modeling [16, 30, 32], diffusions with random switching are used as test beds to quantify the uncertainty from model errors.

The popularity of diffusions with random switching comes from its complexity. Even if (1.1) is an Ornstein–Uhlenbeck (OU) process with each fixed valued of \(Y_t\), that is

$$\begin{aligned} \hbox {d}X_t=-\gamma (Y_t) X_t \hbox {d}t+\sigma (Y_t)\hbox {d}W_t, \end{aligned}$$
(1.2)

the switching of \(Y_t\) can generate very rich nonlinear properties, such as polynomial heavy tails [3, 11] and nonregular invariant measures [2, 29]. In the applications mentioned above, this flexibility is exploited to capture natural phenomena, while the equation in each regime is simple enough for intuitive understanding.

With so many applications of diffusions with random switching, the following questions naturally arise:

  • When does the joint process \(Z_t=(X_t,Y_t)\) possess an invariant measure? What kind of statistics, for example moments of \(|X_t|\), is integrable under the invariant measures?

  • Is the invariant measure unique? How does it attract other statistical states?

These questions in practice are often imposed as important sanity checks for stochastic models. This is because parameters inference and model validation often require the matching of statistics between nature and simulations, while the well posedness of these operations requires existence of invariant measures, ergodicity and finite moments for the models [35, 37].

In the last decade, a series of works have been devoted for the questions above [1, 3,4,5, 7, 11, 39, 46]. In the simplest setting, the transition rates are constants, \(\lambda (x,y,y')=\lambda (y,y')\), and \(X_t\) is driven by the linear equation (1.2). Both questions above are relatively well understood in this setting, thanks to an application of Perron–Frobenius theorem [3, 11]. These intuitive results can be extended to nonconstant transition rates through a probabilistic coupling argument [7, 46]. However, for this argument to work, the transition rates have to be globally bounded and Lipschitz. This restriction excludes many important applications [14, 36] or imposes additional nonphysical compact requirements on the model space [6]. This paper intends to bridge this gap by developing a new analytical framework.

In order for the joint process to have an invariant measure and finite moments for \(|X_t|\), the rough requirement is that the averaged dynamics is dissipative. In the simplest setting of [3, 11], this condition can be defined as \(\mathbb {E}\gamma (Y_t)>0\) with \(\gamma \) as in (1.2). It is difficult to generalize this condition to settings with nonconstant, unbounded transition rates. We will explore two directions for generalization:

  1. 1.

    Inspired by the formulation in [48, 49], we assume there is a \(|X_t|\) controlled multiscale structure in the transition rates \(\lambda \), while the fast averaging procedures induce a dissipation. It is important to note that the multiple scales here are not introduced by an auxiliary variable \(\epsilon \) as in other standard settings [43, 44].

  2. 2.

    There is a comparison principle in favor of dissipation.

Interestingly, in both directions, the averaged dissipation can be demonstrated by constructing a polynomial Lyapunov functions, \(V(x,y)=\sum a_i(y)|x|^{m_i}\). In the multiscale setting, the construction comes as a dual process of the fast averaging procedures, and \(a_i(y)\) represents the potential dissipation of regime \(y\in F\) in a particular transition scale. To the best of the authors’ knowledge, this is the first explicit connection between averaging and Lyapunov functions. Since the \(a_i(y)\) is interpreted as the potential dissipation, which can roughly be defined as a renormalization of \(\mathbb {E}^y\exp (-\int ^t_0\gamma (Y_s){\text {d}}s)\) for large t, the comparison principle has an intuitive formulation, and the verification is much more straightforward and general than the coupling approach used in [7, 46]. This idea of Lyapunov functions construction can be traced back to [39], while our results generalize and offer new probabilistic interpretations to the working conditions.

The second part of this paper discusses geometric ergodicity assuming the existence of a Lyapunov function. Following the frameworks of [17, 19, 41], it suffices to show a version of the small set argument under a proper distance. This can be achieved in two scenarios:

  1. 1.

    If there is a commonly reachable regime that satisfies the minorization condition, Theorem 3.4 proves geometrically ergodicity in the total variation distance.

  2. 2.

    If there is contraction on average, and the transition rates and their first derivatives are bounded by the Lyapunov function, Theorem 3.6 shows geometric ergodicity in a proper Wasserstein distance.

The unbounded transition rates appear to be an major obstacle in the second scenario as the coupling method of [7, 46] fails to work. It is resolved by viewing a diffusion with random switching as an annealed piecewise deterministic Markov processes and then applying the asymptotic coupling framework of [17, 19] to the underlying densities. This strategy was exploited by the authors in [36] to study the piecewise contractive stochastic lattice models in [48, 49], and now we generalize it to the contractive on average setting.

The remainder of this paper is arranged as follows. Section 2 discusses criterions that lead to dissipation on average, and how to construct a Lyapunov function in different scenarios. Section 3 gives the precise statements of geometric ergodicity when there is a hypoelliptic regime or there is contraction on average. Conditions leading to the second scenario are briefly discussed and compared with results in Sect. 2. The proofs of geometric ergodicity are contained in Sect. 4, where we also discuss how to verify the accessibility of one regime. Section 5 summarizes the results and discusses some related questions.

2 Dissipation on average and Lyapunov functions

A simple way to generalize (1.2) to a nonlinear setting is to assume that a rate function \(\gamma :F\mapsto \mathcal {R}\) measures the dissipation and inflation of each regime in F.

Assumption 2.1

With some strictly positive constants K, \(\epsilon >0\),

$$\begin{aligned} \langle b(x,y) , x\rangle \le - \gamma (y)|x|^2 +K,\quad \Vert \sigma (x,y)\Vert ^2\le K |x|^{2-\epsilon }. \end{aligned}$$
(2.1)

Notice that \(\gamma (y)\) could be negative, which introduces an inverse dissipation or inflation, and makes the global dissipation problem nontrivial. Given a transition dissipation pair, \((\lambda , \gamma )\), the main objective of this section is to find intuitive criteria that lead to dissipation on average in different scenarios, and show that there exists a polynomial-like Lyapunov function.

Lyapunov functions are good tools to illustrate dissipation. In this paper, we say a function \(V:E\mapsto [0,\infty )\) is a Lyapunov function if it has compact sublevel sets, and for some strictly positive constants \(\bar{\gamma }\) and K

$$\begin{aligned} \mathcal {L}V(z)\le -\bar{\gamma } V(z) +K,\quad \forall z=(x,y)\in E. \end{aligned}$$
(2.2)

By Dynkin’s formula, Grönwall’s inequality and possibly a localization argument, (2.2) leads to

$$\begin{aligned} \mathbb {E}^zV(Z_t)\le e^{-\bar{\gamma } t}V(z)+K/\bar{\gamma }. \end{aligned}$$
(2.3)

If in addition \(V(z)\ge |x|^m\) for all sufficiently large x, the m-th moment of \(|X_t|\) is bounded under each invariant measure. Note also that by replacing V with \(V+1\), (2.2) holds with a different K. So without lost of generality we can assume \(V\ge 1\).

To continue our discussion, note that the infinitesimal generator of a diffusion with random switching is given by [7]

$$\begin{aligned} \mathcal {L}f(x,y)&=b(x,y)\cdot \nabla _x f(x,y)+\frac{1}{2}\text {tr}[\sigma ^t(x,y) \nabla ^2_x f(x,y)\sigma (x,y)] \nonumber \\&\quad +\sum _{y'\in F}\lambda (x,y,y') (f(x,y')-f(x,y)). \end{aligned}$$
(2.4)

One naive choice of Lyapunov function that leads to a moment bound is simply \(V(x,y)=|x|^m\). Since Assumption 2.1 indicates \(\mathcal {L}V(z)\le -m(\gamma (y)-\epsilon ') V(z)+K'\) for some positive constants \(\epsilon '\) and \(K'\), so with this choice of V, (2.2) holds if \(\gamma (y)>0\) for all y. But this is too restrictive. The limit of this naive choice comes from its ignorance of the Y part, so the averaging effect from transitions is missing. A natural way to incorporate the information of Y is considering the following monomial or polynomial form

$$\begin{aligned} V(z)=a(y)|x|^m,\quad \text {or}\quad V(z)=\sum _{i\le I} a_i(y)|x|^{m_i}. \end{aligned}$$
(2.5)

The coefficients \(a_i(y)\) are strictly positive numbers that represent the potential dissipation, of which the meaning will be discussed in Remark 2.3. By incorporating this information of \(Y_t\), V captures the global dissipation although \(\mathcal {L}\) is a local operator.

We will adopt three simplified notations in the following exposition. First, for Lyapunov functions we only need to be concerned with large x and the constant terms are usually ignorable. The precise statement is given by Lemma 6.2. So we write \(f(z)\lesssim g(z)\) if there is a constant K such that \(f(z)\le g(z)+K\). Second, we often identify a function \(a: F\mapsto \mathcal {R}\) as a vector in \(\mathcal {R}^{|F|}\), with the y-th coordinate being \([a]_y=a(y)\). We also use \(\Lambda (x)\) for the Markov transition matrix on F, with entries \([\Lambda (x)]_{y,y'}=\mathbbm {1}_{y'\ne y}\lambda (x,y,y')-\mathbbm {1}_{y=y'}\bar{\lambda }(x,y)\). With this notation, we can separate the first and second lines of (2.4) into the form

$$\begin{aligned} \mathcal {L}f(x,y)=\mathcal {L}_X(y) f(\,\cdot \, ,y)+[\Lambda (x)f(x,\,\cdot \,)]_y, \end{aligned}$$
(2.6)

where \(\mathcal {L}_X\) and \(\Lambda (x)\) represent the dynamics of the diffusion part and transition part, respectively. Third, \(|x|^m\) is not \(\mathcal {C}^2\) when \(m<2\), and not well defined when \(m<0\), so rigorously speaking \(\mathcal {L}\) cannot apply to it. However, as proposed in [3], \(f_m(x)=|x|^{m+n}/(1+|x|^n)\) with sufficiently large n is \(\mathcal {C}^2\) and carries essentially the same dissipation property of \(|x|^m\), that is for any \(\delta >0\)

$$\begin{aligned} \mathcal {L}_X(y) f_m(x)\lesssim (-m\gamma (y)+\delta )f_m(x), \end{aligned}$$

see Lemma 6.3. So without lost of generality, we assume all \(|x|^m\) are well defined and \(\mathcal {C}^2\), else we just use \(f_m\) in its place.

2.1 Constant transition rates

In order to build up the intuition, let us first review the classical case studied in [3, 11] where \(\Lambda (x)=\Lambda \) is a constant irreducible matrix, and \(X_t\) is driven by the linear equation (1.2). \(Y_t\) then is an ergodic Markov process on F that does not depend on \(X_t\). Let \(\pi \) be the unique ergodic measure of \(Y_t\), and the dissipation on average can be formulated as

$$\begin{aligned} \sum _{y\in F} \pi (y)\gamma (y)>0. \end{aligned}$$
(2.7)

Suppose that \(\sigma \equiv 0\), \(|X_t|^m=\exp (-m\int ^t_0\gamma (Y_s) {\text {d}}s)|X_0|^m\). Then by Jensen’s inequality and Birkhoff ergodic theorem, we see that (2.7) is necessary for the whole dynamics to be dissipative. In fact, it is also sufficient, due to the following theorem, which is a translation of Theorem 1.5 [3] in our context.

Theorem 2.2

Suppose that \(X_t\) follows (1.2) and \(Y_t\) is an ergodic Markov jump process with constant transition rate \(\Lambda \). Suppose also the average dissipation is positive \(\sum \pi (y)\gamma (y)>0\), with \(\pi \) being the ergodic measure of \(Y_t\). Let \(\Gamma \) be the diagonal matrix with entries \(\gamma (y)\) on the yy-th component, then there is an \(m>0\) such that the spectrum of \(-m\Gamma +\Lambda \) lies in the negative half plane, and \(V(x)=a(y)|x|^m\) is a Lyapunov function. Here a as a vector is the Perron eigenvector of \(-m\Gamma +\Lambda \).

Proof

According to Assumption 2.1 and Lemma 6.3, for \(V(x)=a(y)|x|^m\) and any fixed \(\delta >0\),

$$\begin{aligned} \mathcal {L} V(z)&=a(y)\mathcal {L}_X(y) |x|^m+|x|^m [\Lambda a]_y\nonumber \\&\lesssim [-(m\gamma (y)a_y-\delta )+[\Lambda a]_y] |x|^m\nonumber \\&=\left( [(-m\Gamma +\Lambda )a]_y -\delta \right) |x|^m. \end{aligned}$$
(2.8)

Based on (2.8), if a is a right eigenvector of the matrix \(-m\Gamma +\Lambda \) associated with a negative eigenvalue, while a is strictly positive componentwise, then V(x) will be a Lyapunov function in the sense of (2.2). Such a can be found by the following two observations from [3].

First, from the Feynman Kac formula, we find that the \(y,y'\)-th component of matrix \(\exp (-m\Gamma t+\Lambda t)\) is

$$\begin{aligned} \mathbb {E}^{y}\mathbbm {1}_{Y_t=y'}\exp \left( -\int ^t_0 m\gamma (Y_s){\text {d}}s\right) >0, \end{aligned}$$
(2.9)

so Perron–Frobenius theorem applies to \(\exp (-m\Gamma t+\Lambda t)\). As a consequence, if a is the Perron eigenvector, which is the eigenvector associated with the eigenvalue with maximum real part, a is strictly positive componentwise. Since the spectrum of \(\exp (-m\Gamma t+\Lambda t)\) and the spectrum of \(-m\Gamma +\Lambda \) clearly have an one-to-one relation, so a is also the eigenvector of \(-m\Gamma +\Lambda \) associated with the eigenvalue of the maximum real part.

Second, at \(m=0\), the Perron eigenvalue is 0. Through a perturbation analysis of m to the positive direction, one can show the spectrum of \(-m\Gamma +\Lambda \) lies in the negative half plane for small enough m. The details of these results can be found in proposition 4.2 of [3].

Combining these two arguments, we find a strictly positive m, such that the spectrum of \(-m\Gamma +\Lambda \) is in the negative half plane, and the Perron eigenvector a of \(\exp (-m\Gamma t+\Lambda t)\) is an eigenvector of \(-m\Gamma +\Lambda \) associated with a negative eigenvalue, while all the components of a are strictly positive.\(\square \)

Remark 2.3

Let \(\mathbbm {1}\) be the vector with one on each component, since a is the Perron eigenvector, we can approximate a by normalizing \(\exp (-m\Gamma t+\Lambda t)\mathbbm {1}\) with \(t\rightarrow \infty \). Because of the Feynman Kac’s formulation (2.9),

$$\begin{aligned} \frac{a(y)}{a(y')}=\lim _{t\rightarrow \infty }\frac{\mathbb {E}^y \exp \left( -\int ^t_0 m \gamma (Y_s){\text {d}}s \right) }{\mathbb {E}^{y'} \exp \left( -\int ^t_0 m \gamma (Y_s){\text {d}}s \right) }. \end{aligned}$$
(2.10)

In other words, a(y) measures the potential dissipation along the whole future. This explains why V(z) captures the global dissipation: if \(\gamma \) is negative for certain y, then y produces a weaker potential dissipation comparing with average states, so the transition part in (2.8), \([\Lambda (x) a]_y=\sum _{y'} \lambda (x,y,y')(a(y')-a(y))\) could be negative and compensate the inflation in \(m\gamma (y)\).

2.2 Multiscale transitions: one fast scale

When the transition rates are coupled with the diffusion part, so are the trajectories of \(Y_t\) and \(X_t\). This makes the notion of dissipation on average no longer as simple as (2.7). One way to manifest it is finding multiscale structures controlled by \(X_t\) in the transition rates. Such structures rise naturally in many physical models, as part of \(X_t\) represents the temperature of environment [48, 49] or the electronic potential [6] and controls the speed of moisture connectivity or chemical reaction. For the simplicity of discussion, we assume in the following sections that the transition rate matrix has polynomial dependence over |x|:

$$\begin{aligned} \Lambda (x)=\sum _{0\le j\le J}\Lambda _j |x|^{n_j}, \end{aligned}$$
(2.11)

where \(\Lambda _j\) are constant \(|F|\times |F|\) transition matrices, and \(0\le n_0< \cdots <n_J\). We remark that it is possible to generalize our methods below to cases where \(\Lambda (x)\) is the sum of other functions of separate orders. While the discussion below may appear to be abstract in the first read, the ideas are rather elementary. A concrete example will be illustrated by Fig. 1 in Sect. 2.4, and the readers can read that section first for the intuition.

Fig. 1
figure 1

Averaging multiscale transition rates. Subplot (1) is the original system, and (2) and (3) are the systems after one and two averaging steps. The leading-order transitions are marked by solid arrows, and the lower-order transitions are marked by dashed arrows

Since Lyapunov functions concern only large |x|, see Lemma 6.2, it is intuitive that the highest order transition \(\Lambda _J\) plays a dominating role; if over the invariant measure of \(\Lambda _J\) the average of \(\gamma \) is positive, then there should be a Lyapunov function that quantifies dissipation on average. The complications to this argument may come from two aspects: (1) the support of each \(\Lambda _j\) may not be the whole state space F, so different subsets of F may have different transition scales; (2) on the support of each \(\Lambda _j\), \(\Lambda _j\) may not induce an irreducible Markov chain. Here \(F'\subset F\) is the support of a transition rate matrix \(\Lambda \) if \(F'\) is the minimal subset such that \(\lambda (y,y')=0\) if y and \(y'\) are not both in \(F'\). We will leave the first issue to the next subsection and focus first on the averaging phenomenon from multiscale transitions and possible reducible structures.

Irreducibility was a necessary condition in Theorem 2.2, when the transition matrix was constant, but not anymore if the transition is genuinely fast. Consider the following simple example on two states

$$\begin{aligned} F=\{-2,1\}, \quad \gamma (y)=y, \quad \lambda (-2,1)=1, \quad \lambda (1,-2)=0, \quad \hbox {d}X_t=-\gamma (Y_t)X_t \hbox {d}t. \end{aligned}$$

Clearly \(\delta _{1}\) is the invariant measure for \(Y_t\) and there is dissipation on average over this measure. However, \(\mathbb {E}^{x,y} X_t=x\mathbb {E}^y\exp (-\int ^t_0 Y_s{\text {d}}s)\), and if we start from \(y=-2\),

$$\begin{aligned} \mathbb {E}^y\exp \left( -\int ^t_0 Y_s{\text {d}}s\right) \ge \exp (2t)\mathbb {P}^y(Y_s=-2,s\le t)=\exp (t), \end{aligned}$$

so \(\mathbb {E}^{x,y} |X_t|\) diverges to infinity. On the other hand, if we replace the transition rates by \(\lambda (-2,1)=\lambda (1,-2)=|x|+1\), the time that \(Y_t\) spend in \(-2\) is much shorter as \(|X_t|\) gets large, so the dynamics is dissipative on average.

To continue our discussion, we need the notion of connected components. Given a transition rate matrix \(\Lambda (x)\), we say \(F'\subset F\) is an order n maximal connected component of F if the following hold

  1. 1.

    There is a constant \(|F'|\times |F'|\) matrix \(\Lambda _{F'}\) such that \(\Lambda (x)|_{F'}-|x|^n \Lambda _{F'}\) is of order \(|x|^{n-\delta }\) for some \(\delta >0\). Here \(\Lambda (x)|_{F'}\) is the subdiagonal matrix of \(\Lambda (x)\) with indices in \(F'\).

  2. 2.

    For any \(y,y'\in F'\), there is a path \(y=y_0,y_1,\ldots , y_m=y'\) such that for each i, either \(\lambda _{F'}(y_i, y_{i+1})>0\) or \(\lambda _{F'}(y_{i+1},y_i)>0\).

  3. 3.

    \(F'\) is maximal as there is no strict superset of \(F'\) that also satisfies the conditions above.

When such an \(F'\) and \(\Lambda _{F'}\) exist, because \(\Lambda _{F'}|x|^n\) consists of the leading-order terms in the stochastic matrix \(\Lambda (x)\), \(\Lambda _{F'}\) itself must also be a stochastic matrix. As a consequence, there is a Markov jump process with constant rate \(\Lambda _{F'}\) on \(F'\). This is a nice mechanism we will exploit in our discussion later.

Next we define the irreducible components, which is also called the closed communicating classes in the literature like [42]. A subset \(G\subset F'\) is an irreducible component , if (1) for all \(y,y'\in G\), there is a path \(y=y_0,\ldots , y_n=y'\) in G such that \(\lambda _{F'}(y_i,y_{i+1})>0\); (2) for all \(y\in g,y'\notin G\) such a path does not exist. We will use \(G^c\) to denote the transient set, which consists of states in \(F'\) not being in any irreducible components.

Now we consider the simple case where the highest order component is F itself.

Theorem 2.4

Suppose the whole state space F is a maximal connected component of order \(n>0\). Let \(\{G_k\}\) be the irreducible components, and \(\pi _k\) be the ergodic measure generated by \(\Lambda _F|_{G_k}\). Let \(\gamma \) be the linear dissipation rate function satisfying Assumption 2.1, while \(\sum _{y\in G_k} \pi _k(y)\gamma (y)>0\) for each \(G_k\), then for any \(m>0\) there is a Lyapunov function of form

$$\begin{aligned} V(x,y)=|x|^m+a(y)|x|^{m-n}. \end{aligned}$$

Proof

Let \(V_0(z)=|x|^m\). Directly apply the generator, by Lemma 6.3 for any \(\delta >0\)

$$\begin{aligned} \mathcal {L}V_0(z)\lesssim (-m\gamma (y)+\delta )|x|^m. \end{aligned}$$

The right-hand side is a polynomial of order m, and for any irreducible component \(G_k\), \(\sum _{y\in G_k} \pi _k(y)\gamma (y)>0\)

$$\begin{aligned} \sum _{y\in G_k}\pi _k(y)(-m\gamma (y)+\delta )|x|^m\lesssim -2\delta |x|^m, \end{aligned}$$

if \(\delta \) is sufficiently small. Applying a Fredhlom alternative type of argument, which is Lemma 2.5 directly below, there is a monomial \(Q(z)=a(y)|x|^{m-n}\) with \(a(y)\ge 0\), such that

$$\begin{aligned} {[}\Lambda (x)Q(z)]_y+(-m\gamma (y)+\delta )|x|^m\lesssim -\delta |x|^m. \end{aligned}$$

Then because \(V(z)=V_0(z)+Q(z)\) is of order m, and \(\mathcal {L}_X(y) Q(z)\) is of order \(m-n\) by Lemma 6.3, so \(\mathcal {L}V(z)\lesssim -\delta V(z)\), and V(z) is a Lyapunov function. \(\square \)

Lemma 2.5

Let \(F'\) be a maximal connected component of order \(n>0\), and P(xy) be a polynomial of |x| of order \(m>0\) such that \(P(x,y)=0\) if \(y\notin F'\). Suppose the following holds

$$\begin{aligned} \sum _{y\in G_k} \pi _k (y) P(x,y)\lesssim 0,\quad \forall x\in \mathcal {R}^d. \end{aligned}$$

Then we can find a positive monomial \(Q(z)=q(y)|x|^{m-n}\) with \(q(y)\ge 0\) and \(q(y)=0\) if \(y\in F/F'\), such that for any \(\epsilon >0\)

$$\begin{aligned} \Lambda (x) Q(z)+P(z)\lesssim \epsilon |x|^m,\quad y\in F'. \end{aligned}$$

Proof

First, we will specify the value of q(y) for each nontransient y. We assume \(y\in G_k\) in the discussion below. Let \(p(y)|x|^m\) be the maximum order term in P(z), then clearly \(\sum _{y'\in G_k}\pi _k (y') p(y')\le 0\). By the Fredhlom alternative, there is a vector \(q_k\) with nonzero components only for indices in \(G_k\), so that

$$\begin{aligned} p|_{G_k}+\Lambda _{F'} q_k =\left( \sum _{y'\in G_k} \pi _k(y')p(y')\right) \mathbbm {1}_{G_k}. \end{aligned}$$
(2.12)

\(\mathbbm {1}_{G_k}\) stands for the indicator vector of set \(G_k\). Note that \(\Lambda _{G_k}\mathbbm {1}_{G_k}=0\), we can always replace \(q_k\) with \(q_k+\kappa \mathbbm {1}_{G_k}\) with a proper \(\kappa \), so it is still a solution to (2.12), but with strictly positive components. So we can assume \(q_k>0\) componentwise. We let \(q(y)=q_k(y)\) for nontransient y.

Next, for the transient states \(y\in G^c\), consider a Markov jump process \(Y'_t\) on \(F'\), driven by the transition matrix \(\Lambda _{F'}\). Denote the expected time of hitting any one of \(G_k\) from any y as T(y). Clearly \(T(y)=0\) for \(y\in G_k\). By running the one-step analysis for the jumps, we find for \(y\in G^c\),

$$\begin{aligned} T(y)=\bar{\lambda }^{-1}_{F'} (y)+\sum _{y'}\frac{\lambda _{F'}(y,y')}{\bar{\lambda }_{F'}(y)}T(y'), \end{aligned}$$

recall that \(\bar{\lambda }_{F'}(y)=\sum _{y'\ne y} \lambda _{F'}(y,y')\). As a consequence, for \(y\in G^c\),

$$\begin{aligned} \Lambda _{F'} T(y)=\sum _{y'\in F'}\lambda _{F'}(y,y')(T(y')-T(y))=-1. \end{aligned}$$

We will let \(q(y)=\beta T(y)\) with

$$\begin{aligned} \beta > \max _{y\in F'}|p(y)|+(\max _{y\in F'}\bar{\lambda }_{F'}(y))(\max _{y\in G_k,\forall k}q(y)). \end{aligned}$$

Now we verify our claim it suffices to show \(\Lambda (x) Q(z)+P(z)\) has its order m term being less than 0. Since \((\Lambda (x)-\Lambda _{F'}|x|^n)Q(z)\) is of order strictly less than m, it suffices to show

$$\begin{aligned} {[}\Lambda _{F'}q]_y+p(y) \le 0, \quad \forall y\in F'. \end{aligned}$$

This is clearly the case when \(y\in G_k\) because it is implied by (2.12). And for \(y\in G^c\), this holds because

$$\begin{aligned} {[}\Lambda _{F'}q]_y+p(y)&=\sum _{y'\in F'}\lambda _{F'}(y,y')(q(y')-q(y))+p(y)\\&=\sum _{y'\in G^c}\lambda _{F'}(y,y')\beta (T(y')-T(y))\\&\quad \,+\sum _k\sum _{y'\in G_k}\lambda _{F'}(y,y')(q(y')- \beta T(y))+p(y)\\&\le \beta \Lambda _{F'}T(y)+\bar{\lambda }_{F'}(y)\max _{y'\in G_k,\forall k}q(y')+p(y)< 0. \end{aligned}$$

\(\square \)

2.3 Multiscale transitions: multiple scaling structures

When F is not the maximal connected component, the transitions inside a maximal connected component \(F'\) of highest order will be significantly faster than transitions outside. These fast transitions will average the dissipation of each irreducible component \(G_k\) inside \(F'\) and also the manner how \(Y_t\) leaves \(F'\). In the perspective of the states outside \(F'\), each \(G_k\) is essentially a single point, and its dissipation rate is the averaged dissipation over \(\pi _k\); each transient state \(y\in G^c\) is an intermediate state that can jump to any of the \(G_k\), while the time \(Y_t\) spent on it is ignorable, as long as the rates toward nonirreducible parts, \(\lambda (x,y,y')\) with \(y'\in G^c\cup F/F'\), are not too strong.

To be more specific, given a transition dissipation pair \((\Lambda (x),\gamma )\) with \(\Lambda (x)\) being of order \(|x|^n\), let \(F'\) be a maximal connected component, \(G_1,\ldots , G_k\) be the irreducible sets, and \(G^c\) be the transient set. We define a new structure \((\widetilde{\Lambda }(x), \tilde{\gamma })\) on the averaged space \(\widetilde{F}=(F/F')\cup \{g_1,\ldots , g_K\}\) as the average of the original structure on \(F'\). Intuitively, the new rates are the same for states outside \(F'\),

$$\begin{aligned} \tilde{\gamma }(y)=\gamma (y),\quad \tilde{\lambda }(x,y,y')=\lambda (x,y,y'),\quad y,y' \in F/F'. \end{aligned}$$

The rates related to \(g_k\) are given by the following averages:

$$\begin{aligned} \tilde{\gamma }(g_k)= & {} \sum _{y\in G_k} \pi _k(y)\gamma (y),\quad \tilde{\lambda }(x,g_k,y)=\sum _{y'\in G_k} \pi _k(y')\lambda (x,y',y),\,\,\,\, y\in F/F',\nonumber \\ \tilde{\lambda }(x, y,g_k)= & {} \sum _{y'\in G_k}\lambda (x,y,y')+\sum _{y'\in G^c} \lambda (x,y,y') p_{F'} (y, g_k),\quad \quad \,\,\, y\in F/F', \\ \tilde{\lambda }(x, g_j,g_k)= & {} \sum _{y\in G_j}\pi _j(y)\sum _{y'\in G_k}\lambda (x,y,y')+\sum _{y\in G_j}\pi _j(y)\sum _{y'\in G^c} \lambda (x,y,y') p_{F'} (y', g_k).\nonumber \end{aligned}$$
(2.13)

In above, \(\pi _k\) is the ergodic measure on \(G_k\) induced by the matrix \(\Lambda _{F'}\), and \(p_{F'} (y, g_k)\) is the probability that \(Y'_t\) ends up in \(G_k\) if \(Y'_t\) is a Markov chain driven by \(\Lambda _{F'}\) and starts from y.

Note that in these averaging procedures, the transition rates from \(y\in G^c\) to \(y'\in G^c\cup F/F'\) are completely wiped out. So we need these rates to be not too strong, else the averaged structure cannot represent this information. In particular we have the following nondominating condition.

Assumption 2.6

For any transient state y and any \(y'\in G^c\cup F/F'\), suppose that \(p_{F'} (y,g_k)>0\), then there is a \(y''\in G_k\) such that \(\lambda (x,y,y')\) has at most the same polynomial order in |x| as \(\lambda (x,y'',y')\).

Since there are only finitely many states, there are only finitely many, say \(m_J\), connected components with the highest order \(n_J\) in (2.11). After applying an averaging step on one of these components, \(F'\), the transition rates related to \(F'\) are of order strictly less than \(n_J\), and the states after averaging have a smaller cardinality \(|\widetilde{F}|\le |F|\). So after \(m_J\) steps, the transition rates are of order \(n_{J-1}\). We can repeat this argument J times and finally end up with an averaged transition matrix \(\widetilde{\Lambda }\) being a constant matrix. Intuitively, this constant matrix dictates whether the original system is dissipative on average.

Theorem 2.7

Let state space \(\widetilde{F}\), constant transition rates \(\widetilde{\Lambda }\) and dissipation rates \(\tilde{\gamma }\) be the final result of a sequence of averaging procedures. Suppose at each averaging step, the transient transition rates follow the nondominating condition, Assumption 2.6. Then the original system has a polynomial-like Lyapunov function of some order \(m>0\), if \(\widetilde{F}\) consists of only irreducible components of \(\widetilde{\Lambda }\), while on each of them the average dissipation of \(\tilde{\gamma }\) is positive. If in addition \(\tilde{\gamma }(y)>0\) for all \(y\in \widetilde{F}\), m can be any positive number.

Theorem 2.4 was a special one averaging step case of the theorem above, and the conditions there were not optimal. But we keep Theorem 2.4 for its simpler intuition.

Proof of Theorem 2.7

Based on Theorem 2.2 it is clear how to find a Lyapunov function \(\widetilde{V}(z)\) for the final averaged dynamics \((\widetilde{\Lambda },\tilde{\gamma })\). In particular, if \(\tilde{\gamma }(y)>0\) for all y, then for any \(\delta>0,m>0\),

$$\begin{aligned} \mathcal {L}|x|^m\lesssim \big (-m\min _y\{\tilde{\gamma }(y)\}+\delta \big )|x|^m, \end{aligned}$$

so \(|x|^m\) is a Lyapunov function. Then by the induction principle, it suffices to show that, given an averaging step

$$\begin{aligned} (F, \Lambda , \gamma )\Rightarrow \big (F/F'\cup \{g_1,\ldots ,g_K\}, \widetilde{\Lambda },\tilde{\gamma }\big ) \end{aligned}$$

and a polynomial-like Lyapunov function \(\widetilde{V}(z)=\sum \tilde{a}_i(y) |x|^{n_i}\) for the averaged structure, how to construct a new polynomial Lyapunov function V(z).

One thing that requires special attention is the order of Lyapunov function \(\widetilde{V}\) and the detailed transitions, defined as

$$\begin{aligned} \tilde{\lambda }(x,y,y')(\widetilde{V}(x,y')-\widetilde{V}(x,y)). \end{aligned}$$

These polynomials are clearly of order at most m in the final state, since \(\tilde{\lambda }\) are constants and \(\widetilde{V}\) is of order m. We will show this polynomial order is inherited by the constructed Lyapunov function V(z) of the pre-averaged dynamics.

Denote the maximal order term in \(\widetilde{V}(z)\) as \(\tilde{a}_m(y)|x|^m\). Because \(\widetilde{V}\) is a Lyapunov function, there is a \(\gamma _0>0\) such that the order m terms in \(\widetilde{\mathcal {L}} \widetilde{V}(z)\) are

$$\begin{aligned}{}[\widetilde{\Lambda }(x) \widetilde{V}(x)]_{y}-m\tilde{\gamma }(y)\tilde{a}_i(m)|x|^m\lesssim -\gamma _0 \widetilde{V}(z),\quad y\in F/F'\cup \{g_1,\ldots ,g_K\}, \end{aligned}$$
(2.14)

because \(\mathcal {L}_X |x|^{m-\delta }\) is of order less than m based on Lemma 6.3. To continue, we notice the dual of the averaging step produces the following function on \(\mathcal {R}^d\times F\) based on \(\widetilde{V}\):

$$\begin{aligned} V_0(x,y)={\left\{ \begin{array}{ll}\widetilde{V}(x,y),\quad &{}y\in F/F';\\ \widetilde{V}(x,g_k),\quad &{}y\in G_k;\\ \sum _k p_{F'}(y,g_k)\widetilde{V}(x,g_k), &{}y\in G^c. \end{array}\right. } \end{aligned}$$

Decompose the transition rates into two parts \(\Lambda (x)=\Lambda _{F'}|x|^n+\Lambda ^c(x)\), where \(\Lambda ^c\) is of order \(n-\delta \) for some \(\delta >0\). With some technical verification in Lemma 6.4, the following duality equations hold

$$\begin{aligned} {\left\{ \begin{array}{ll} {[}\widetilde{\Lambda }(x)\widetilde{V}(x,\,\cdot \,)]_y\,=[\Lambda (x) V_0(x,\,\cdot \,)]_y,\quad y\in F/F';\\ {[}\widetilde{\Lambda }(x)\widetilde{V}(x,\,\cdot \,)]_{g_k}=\sum _{y\in G_k}\pi _k(y)[\Lambda (x) V_0(x,\,\cdot \,)]_y. \end{array}\right. } \end{aligned}$$
(2.15)

We claim that the detail transitions of \(V_0\) induced by \(\Lambda (x)\) are of order at most m. The detail transition of \(V_0\) from a \(y\in G_k\) to a \(y'\in F/F'\) is \(\lambda (x,y,y')(\widetilde{V}(x,g_k)-\widetilde{V}(x,y'))\). A combination of \(y'\in G_k\) is

$$\begin{aligned} \tilde{\lambda }(x,g_k,y') (\widetilde{V}(x,g_k)-\widetilde{V}(x,y'))=\left( \sum _{y\in G_k} \pi _k(y)\lambda (x,y,y')\right) (\widetilde{V}(x,g_k)-\widetilde{V}(x,y')). \end{aligned}$$

The left-hand side is of order at most m from inductions; on the right hand, the coefficients follow \(\lambda (x,y,y')\ge 0, \pi _k(y)>0\). So \(\lambda (x,y,y')(\widetilde{V}(x,g_k)-\widetilde{V}(x,y'))\) is of order at most m. Likewise, because

$$\begin{aligned}&\tilde{\lambda }(x,g_k,g_j)(\widetilde{V}(x,g_j)-\widetilde{V}(x,g_k))\\&\quad =\sum _{y\in G_k} \pi _k(y)\left[ \sum _{y'\in G^c}\lambda (x,y,y')\sum _j p_{F'}(y',g_j)+\sum _{y'\in G_j}\lambda (x,y,y')\right] (\widetilde{V}(x,g_j)-\widetilde{V}(x,g_k)), \end{aligned}$$

we can conclude the detail transition of \(V_0\) from \(y\in G_k\) to \(y'\in G_j\) or \(G^c\) is of order at most m. With the same argument, the detail transition of \(V_0\) from \(y\in F/F'\) to other \(y'\) is all of order at most m. For \(y\in G^c\), the detail transition to any \(y'\) will be

$$\begin{aligned} \lambda (x,y,y')(V_0(x,y')-V_0(x,y))=\lambda (x,y,y') \sum _{k}p_{F'}(y,g_k)(V_0(x,y')-V_0(x,y_k)), \end{aligned}$$

where \(y_k\) is any element in \(G_k\). Then because of the nondominating condition Assumption 2.6, and that \(\lambda (x,y_k,y')(V_0(x,y')-V_0(x,y_k))\) is of order at most m, so is \(\lambda (x,y,y')(V_0(x,y')-V_0(x,y))\).

If we let \(V(z)=V_0(x,y)\), the Lyapunov dissipation will be inherited for \(y\in F/F'\), but there will be an order m error term for \(y\in F'\), and we will apply Lemma 2.5 to fix this with a monomial. In particular, the image of \(V_0\) through \(\mathcal {L}\) is

$$\begin{aligned} \mathcal {L}V_0(z)\lesssim P(x,y):=-m\gamma (y)\tilde{a}_m(y)|x|^m+[\Lambda (x)V_0(x)]_y. \end{aligned}$$

The average of P(xy) over any \(G_k\), by the second duality equation, is

$$\begin{aligned} \sum _{y\in G_k}\pi _k(y)([\Lambda (x) V_0(x,\,\cdot \,)]_y-\gamma (y)\tilde{a}_{m}(g_k)|x|^m)=[\widetilde{\Lambda }(x) \widetilde{V}(x)]_{g_k}-\tilde{\gamma }(g_k)\tilde{a}_{m}(g_k)|x|^m, \end{aligned}$$

which is bounded by (2.14). Since the P(xy) is of order at most m, by the Fredhlom alternative Lemma 2.5, there is a positive monomial Q(z) of order less than \(m-n\), such that if we let \(V(z)=V_0(z)+Q(z)\), then for all \(y\in F'\)

$$\begin{aligned} -\gamma (y)\tilde{a}_{m}(g_k)|x|^m+[\Lambda (x)(Q(z)+V_0(z))]_y\lesssim -\gamma _0V_0(z). \end{aligned}$$

Since the order of Q(z) is less than \(m-n\), and \(\mathcal {L}_X(y)Q(z)\) produces a term of order at most \(m-n\), we find that \(\mathcal {L}V(z)\lesssim -\gamma _0 V(z)\) if \(y\in F'\).

As for \(y\in F/F'\),

$$\begin{aligned} \mathcal {L}V(z)\lesssim -\tilde{\gamma }(y)\tilde{a}_{m}(g_k)|x|^m+[\widetilde{\Lambda }(x)\widetilde{V}(x)]_y+[\Lambda (x) Q(z)]_y. \end{aligned}$$

Notice the first two parts are bounded by (2.14). \([\Lambda (x)Q(z)]_y=[\Lambda ^c(x)Q(z)]_y\) because \(F'\) is the connected component of \(\Lambda _F\). Then notice Q(z) is of order at most \(m-n\), while \(\Lambda ^c(x)\) is of order strictly less than n; therefore, \([\Lambda ^c(x)Q(z)]_y\) is of order strictly less than m, so \(\mathcal {L}V(z)\lesssim -\frac{1}{2}\gamma _0 V(z)\).

Lastly, we notice the detail transition of V(z) is the sum of detail transition of \(V_0(z)\) and Q(z), the first part is of order m from previous discussion, and \(\lambda (x,y,y')(Q(x,y')-Q(x,y))\) is of order at most m as well. So the detail transitions of V(z) are of order at most m. \(\square \)

Remark 2.8

Once we finish the construction of \(V=\sum a_i(y)|x|^{m_i}\) and look back, we can see \(a_i(y)\) captures the potential dissipation with the transition rates of order \(|x|^{m_I-m_i}\) and within a maximal connected component of that order. And for y in this maximal connected components, \(a_j(y)\) are of identical value for \(j\ge i\). In other words, from the value of the sequence \(\{a_i(y)\}_{i\le I}\), we can actually tell which connected component of what order does y belong to. The following subsection gives a simple and concrete example.

2.4 A multiscale transition example

In this section, we consider one concrete example with multiscale transitions, where the averaging steps mentioned in the previous section can be discussed explicitly. In subplot (1) of Fig. 1, a Markov process is defined on four states \(F=\{a,b,c,d\}\) with the transition rates given along the arrows. The dissipation rates are given by

$$\begin{aligned} \gamma (a)=-1,\quad \gamma (b)=2,\quad \gamma (c)=-1,\quad \gamma (d)=-1. \end{aligned}$$

F is the maximal connected component of order 2. It has two irreducible components \(\{a,b\}\) and \(\{d\}\). The induced invariant measure on \(\{a,b\}\) is \(\pi (a)=\frac{1}{3},\pi (b)=\frac{2}{3}\). c is the only transient state. Starting from c and driven by the maximal order transition \(\Lambda _F\), it is equal likely to end up in \(\{a,b\}\) and \(\{d\}\).

After one averaging step, we have a two-state Markov chain in subplot (2). The states represent the irreducible components in the original system. The dissipation rates are given by

$$\begin{aligned} \tilde{\gamma }(ab)=\pi (a)\gamma (a)+\pi (b)\gamma (b)=1,\quad \tilde{\gamma }(d)=\gamma (d)=-1. \end{aligned}$$

The transition rates are given by

$$\begin{aligned} \tilde{\lambda }(x,ab, d)=\pi (b)\lambda (x,b,d)=|x|,\quad \tilde{\lambda }(x,d,ab)=p_{F}(c,ab)\lambda (x,d,c)=2|x|. \end{aligned}$$

So the chain is of order 1, while invariant measure driven by the dynamics of this order is \(\tilde{\pi }(ab)=\frac{2}{3}, \tilde{\pi }(d)=\frac{1}{3}\).

With the final step of averaging, we end up with one state in (3), so the transition matrix is the constant matrix of zero. The dissipation rate is given by \(\tilde{\pi }(ab)\tilde{\gamma }(ab)+\tilde{\pi }(d)\tilde{\gamma }(d)=\frac{1}{3}\). So the whole system is dissipative on average, Theorem 2.7 applies, and the Lyapunov function can be of any order.

In particular, for the trivial Markov process described by subplot (3), \(|x|^m\) with any \(m>0\) is a Lyapunov function. The procedure in the proof of Theorem 2.7 indicates this Lyapunov function can be pulled back into Lyapunov functions for the Markov processes described by (2) and (1):

$$\begin{aligned} |x|^m+a_2(y)|x|^{m-1}+a_1(y)|x|^{m-2}\quad \Longleftarrow \quad |x|^m+\tilde{a}_2(y)|x|^{m-1}\quad \Longleftarrow \quad |x|^m. \end{aligned}$$

where \(\tilde{a}_2(ab)=0, \tilde{a}_2(d)=\frac{2}{3}m\); \(a_2(a)=a_2(b)=0\), \(a_2(c)=\frac{1}{3} m, a_2(d)=\frac{2}{3} m\), \(a_1(a)=\frac{2}{3}m,a_1(b)=0,a_1(c)=\frac{3}{2}m,a_1(d)=0\), assuming \(m\ge 4\) so \(|x|^{m-2}\) is \(\mathcal {C}^2\).

In order to illustrate Assumption 2.6, we consider one modification of (1) in Fig. 1 that violates Assumption 2.6. Suppose there is another state e, and

$$\begin{aligned} \lambda (x,a,e)=\lambda (x,b,e)=\lambda (x,d,e)=1,\quad \lambda (x,c,e)=|x|. \end{aligned}$$

If one averages again like Fig. 1, the strong transition from c to e will be ignored, because c is a transient state in the averaging procedure.

2.5 Comparison principle

The other way to deal with nonconstant transition rate is through a comparison principle. To be specific, suppose \((\Lambda (x), \gamma )\) is dissipative on average, which may be established by Theorem 2.2 or Theorems 2.4 and 2.7. Suppose also in another transition dissipation pair \((\widetilde{\Lambda }(x), \tilde{\gamma })\), the dissipation is stronger in all regimes, while the regime transitions are more favorable for dissipation; then, intuitively \((\widetilde{\Lambda }(x), \tilde{\gamma })\) would also admit a dissipation on average.

The only vagueness of the previous argument is how to determine that the regime transitions are more favorable for dissipation. In our contexts, since the coefficients \(a_i(y)\) in (2.5) characterize the potential dissipation, see Remark 2.3, so intuitively, we would say \((\widetilde{\Lambda }(x), \tilde{\gamma })\) has more favorable dissipation than \((\Lambda (x), \gamma )\) if

$$\begin{aligned} \tilde{\gamma }(y)\ge \gamma (y),\quad (\tilde{\lambda }(x,y,y')-\lambda (x,y,y'))(a_i(y')-a_i(y))\le 0,\quad \forall i. \end{aligned}$$

In this interpretation, the comparison principle is straightforward and can be generalized to dynamics on different spaces, where the state space F can be countable.

Theorem 2.9

Let P be a mapping from F to \(F'\). Suppose the transition dissipation pair \((\Lambda '(x),\gamma ')\) on \(F'\) admits a Lyapunov function of polynomial form \(V(x,y)=\sum _{i\le I} a_i(y)|x|^{m_i}\) with \(a_i\ge 0\). Suppose \((\Lambda (x), \gamma )\) on F is more favorable for dissipation in the sense that for any \(i\le I\) and \(q\in F, y\in F'\)

$$\begin{aligned} \gamma (q)\ge \gamma (P(q)),\quad \left( \sum _{P(q')=y}\lambda (x,q,q')-\lambda '(x,P(q),y)\right) (a_i(y)-a_i(P(q)))\le 0. \end{aligned}$$
(2.16)

Then V(xP(q)) will be a Lyapunov function for \((\Lambda (x), \gamma )\).

Proof

Let \(m_I\) be the maximum polynomial order, the fact that V is a Lyapunov function for \((\Lambda '(x),\gamma ')\) indicates that \(\forall y\in F'\)

$$\begin{aligned} -m_Ia_{I}(y)\gamma '(y)|x|^{m_I}+\sum _{i\le I} [\Lambda '(x) a_i]_y|x|^{m_i}\lesssim -\bar{\gamma } a_{I}(y)|x|^{m_I} \end{aligned}$$
(2.17)

for a \(\bar{\gamma }>0\). According to Assumption 2.1 and Lemma 6.3, for any \(\delta >0\) the following holds

$$\begin{aligned} \mathcal {L}V(x,P(q))\lesssim & {} (-m_I\gamma (q)+\delta )a_I(P(q))|x|^{m_I}\\&+\,\sum _{q'\in F'} \lambda (x,q,q')\sum _{i\le I}(a_i(P(q'))-a_i(P(q)))|x|^{m_i}{.} \end{aligned}$$

Note that

$$\begin{aligned} \sum _{q'\in F'} \lambda (x,q,q')(a_i(P(q'))-a_i(P(q)))&= \sum _{y\in F}\left( \sum _{P(q')=y}\lambda (x,q,q')\right) (a_i(y)-a_i(P(q)))\\&\le \sum _{y\in F}\lambda '(x,P(q),y)(a_i(y)-a_i(P(q))). \end{aligned}$$

Combine both inequalities above and compare it with (2.17), \(\mathcal {L}V\lesssim (-\bar{\gamma } a_{I}(y)+\delta )|x|^{m_I}\). Since \(\delta \) can be arbitrarily small, it can further be bounded by \(-\bar{\gamma }'a_I(y)|x|^{m_I}\) hence also \(-\bar{\gamma }'V\) for some \(\bar{\gamma }'>0\). \(\square \)

The simplicity of this proof comes from our interpretation of dissipation on average through polynomial Lyapunov functions. On the other hand, comparison principles can also be demonstrated by coupling methods when the transition rates are bounded. Cloez and Hairer [7] and Shao [46] have shown dissipation on average in the following birth–death scenario through proofs of considerate length, while it is only a special case of Theorem 2.9.

Corollary 2.10

(Birth–death-type criterion) Suppose there is a partition of regimes \(F=F_1\cup \cdots \cup F_n\), and there is an increasing sequence of dissipation rates \(\beta _1,\ldots , \beta _n\). Suppose

$$\begin{aligned} \lambda (x,y,y')=0\quad \text {if}\quad y,y'\text { are not in neighboring } F_i, \end{aligned}$$

and \(\gamma (y)\ge \beta _k\) if \(y\in F_k\), where \(\gamma \) is the linear dissipation rate as in Assumption 2.1. For \(k=1,\ldots , n\), denote

$$\begin{aligned} b_k=\inf _{x\in E}\inf _{y\in F_k}\sum _{y'\in F_{k+1}}\lambda (x,y,y'),\quad d_k=\sup _{x\in E}\sup _{y\in F_{k}}\sum _{y'\in F_{k-1}}\lambda (x,y,y'),\quad \nu _k=\prod _{i=1}^k \frac{b_k}{d_k}. \end{aligned}$$

Then the transition dissipation pair \((\Lambda (x),\gamma )\) has a monomial Lyapunov function if \(\sum _{k=1}^n\beta _k\nu _k>0\).

Proof

Let \(F'=\{1,\ldots ,n\}\), and \(Y'_t\) be a birth–death process on \(F'\) with birth rate \(b_k\) and death rate \(d_k\). Using the detailed balance relation, it is easy to see \(\nu _k\) is a multiple of the invariant measure of \(Y'_t\). So if \(X'_t\) follows (1.2) with \(\gamma (Y'_t)=\beta _{Y'_t}\), Theorem 2.2 applies, and there is a monomial Lyapunov function \(a(y)|x|^m\). Meanwhile, let \(Y''_t\) be another birth–death process on \(F'\) with the same rates, we couple \(Y'_t\) and \(Y''_t\) so that they jump independently before they are of the same value, and then we make them do the same jumps simultaneously. Then if \(Y'_0\ge Y''_0\), we have for all \(t>0\), \(\beta _{Y_t'}\ge \beta _{Y_t''}\). Since the ratio \(a(k)/a(k')\) can be given by (2.10), it is clear that \(a(k)\ge a(k')\) if \(k\le k'\). Then we apply Theorem 2.9 with \(P: y\rightarrow k\) if \(y\in F_k\) to see the original transition dissipation pair has a Lyapunov function. Condition (2.16) holds because if we let

$$\begin{aligned} b(q,k)=\left( \sum _{P(q')=k}\lambda (x,q,q') -\lambda '(x,P(q),k)\right) (a_i(k)-a_i(P(q))), \end{aligned}$$

then \(b(q,k)=0\) if \(q\in F_j, k\notin \{j-1,j+1\}\). And if \(q\in F_j\)

$$\begin{aligned} b(q,j-1)= \left( \sum _{P(q')=j-1}\lambda (x,q,q')-d_j\right) (a_{j-1}-a_{j})\le 0 \end{aligned}$$

and likewise \(b(q,j+1)\le 0\). \(\square \)

3 Geometric ergodicity

In Sect. 2, Lyapunov functions are constructed for diffusions with random switching when the dynamics is dissipative on average. Because these Lyapunov functions are in the form of polynomials, \(\mathbb {E}|X_t|^m\) is bounded uniformly for a proper m. Then by the Krylov–Bogoliubov theorem [9], there is at least one invariant measure for the joint process \(Z_t=(X_t,Y_t)\). It is natural to ask whether this invariant measure is unique, and how does the law of \(Z_t\) converge to the unique invariant measure \(\pi \).

For many stochastic processes, this question is answered by geometric ergodicity. Namely, let d be a distance for probability measures, and \(P^*_t \mu \) be the law of \(Z_t\) given that \(Z_0\sim \mu \), and then there is a \(\gamma >0, C_{\mu ,\nu }\) such that

$$\begin{aligned} \hbox {d}(P^*_t\mu , P^*_t\nu )\le e^{-\gamma t} C_{\mu ,\nu }. \end{aligned}$$

By letting \(\mu \) be an invariant measure, the bound above indicates there is only one invariant measure and all other statistical states are attracted to it geometrically fast.

In a series of important works on this subject [17, 19, 40, 41], a general framework has been developed to verify geometric ergodicity, assuming that a Lyapunov function exists. In the following we will apply this framework and show \(Z_t\) is geometric ergodic

  1. 1.

    in total variation distance if there is a commonly reachable minorization regime;

  2. 2.

    in a proper Wasserstein distance if there is contraction on average.

The Lyapunov function V will play a regularizing role in our discussion. In order to show geometric ergodicity for unbounded rates, we will replace the globally bounded or Lipschitz conditions in [7, 46] with weaker requirements that the transition rates or their derivatives are bounded by the Lyapunov function V. One important consequence is that an explosion, that is \(|X_t|\) reaches infinity or \(Y_t\) has infinitely many jumps in finite time, is no longer possible. To see this, let \(T_t\) be the number of jumps in \(Y_{s\le t}\), then a space–time version of (2.4) shows that \(\mathcal {L}T_t=\bar{\lambda }(z)\). If \(\bar{\lambda }(z)\le MV(z)\) for a constant M, then applying Dynkin’s formula to the stopping time \(\tau _N\), which is the first time either \(|X_t|=N\) or \(T_t=N\),

$$\begin{aligned} \mathbb {E}\left[ T_{t\wedge \tau _N}+V(Z_{t\wedge \tau _N})\right]&\le \mathbb {E}\int ^{t\wedge \tau _N}_0 (\bar{\lambda }(Z_s)-\gamma V(Z_s)+K){\text {d}}s\\&\le \int ^t_0 \mathbb {E}(MV(Z_s)+K){\text {d}}s\le t M(\mathbb {E}V(Z_0)+K/\gamma )+Kt. \end{aligned}$$

By letting \(N\rightarrow \infty \), we find \(T_t<\infty , V(Z_t)<\infty \) a.s. Since V has compact sublevel sets, \(|X_t|< \infty \) a.s. We have not mentioned this technical issue till now for two reasons: (1) if it is mentioned in the beginning of Sect. 2, it is unclear how to find the Lyapunov function V; (2) both \(\mathcal {L}\) and (2.2) are well defined without this condition, so it was safe for Sect. 2 to work without this condition.

3.1 Convergence in total variation with a minorization regime

The classical notion of ergodicity is often illustrated in the total variation norm, which is defined as

$$\begin{aligned} \Vert \mu -\nu \Vert _{tv}=\sup _{|f|\le 1}\int f \hbox {d}\mu -\int f\hbox {d}\nu . \end{aligned}$$

This norm is also called the \(L^1\) distance, because if \(\mu \) and \(\nu \) have densities p and q, \(\Vert \mu -\nu \Vert _{tv}=\int |p-q|\hbox {d}x\). Geometric ergodicity in total variation is well studied and understood in the finite-dimensional Markov chain and SDE [40, 41]. A classical formulation can be found in Theorem 1.5 of [19] or Theorem 2.10 of [7], and here we present it using our notation:

Theorem 3.1

Let \(P_t\) be a Markov semigroup admitting a Lyapunov function V. Suppose the minorization condition holds, in the sense that for a sufficiently large C, there is a probability measure \(\nu \) and \(\epsilon , t_0>0 \), such that

$$\begin{aligned} \Vert P^*_{t_0}\delta _z -P^*_{t_0}\delta _{z'}\Vert _{tv}>2-\epsilon ,\quad \forall z,z': V(z), V(z')\le C. \end{aligned}$$

Then \((P_t)_{t\ge 0}\) has a unique invariant measure \(\pi \), and for some positive constants D and \(\beta \),

$$\begin{aligned} \Vert P_t^*\delta _z-\pi \Vert _{tv}\le D e^{-\beta t}(1+V(z)). \end{aligned}$$

When F consists of only one regime, \(X_t\) is simply an SDE on \(\mathcal {R}^d\). In this context, following the arguments in [40], the minorization conditions of Theorem 3.1 can be verified by the hypoellipticity and reachability conditions below:

Assumption 3.2

Let \(\hbox {d}X_t=f(X_t)\hbox {d}t+\sigma (X_t)\circ \hbox {d}W_t\) be a diffusion process in \(\mathcal {R}^d\), where \(\circ \) denotes the Stratonovich integral,

  1. 1.

    Hypoellipticity condition: let \(\mathcal {L}\) be the Lie algebra generated by the vector fields

    $$\begin{aligned} \{f,\sigma _1,\ldots , \sigma _m\} \end{aligned}$$

    with \(\sigma _i\) being the columns of \(\sigma \), and \(\mathcal {L}_0\) is the ideal in \(\mathcal {L}\) generated by \(\{\sigma _1,\ldots , \sigma _m\}\), assuming \(\mathcal {L}_0\) spans \(\mathcal {R}^d\) at all points.

  2. 2.

    Reachability condition: there is a point \(x_h\in \mathcal {R}^d\) such that for any compact set C and \(\epsilon >0\), there is a \(t_0\) such that from any \(x\in C\) there is a piecewise constant process \(w_t\) such that the solution to the following ODE

    $$\begin{aligned} \hbox {d}x_t=[b(x_t)+\sigma (x_t)w_t] \hbox {d}t,\quad x_0=x, \end{aligned}$$

    satisfies \(|x_{t_0}-x_h|\le \epsilon \).

Theorem 3.4 below indicates that for diffusions with random switching, it suffices to check minorization condition for one particular regime, using say Assumption 3.2, and show this regime is commonly accessible and satisfies a mild growth condition for \(V(Z_t)\). In particular, we define

Definition 3.3

A regime \(y^*\in F\) is commonly accessible, if for all \(z\in \mathcal {R}^d\times F\) there is a \(t>0\) such that \(\mathbb {P}^z(Y_t=y^*)>0\). We say it has polynomial growth for a function V, if there is a constant \(K_t\) with polynomial growth in t, such that the following holds for the SDE \(\hbox {d}X'_t=b(X'_t,y^*)\hbox {d}t+\sigma (X'_t,y^*)\hbox {d}W_t\),

$$\begin{aligned} \mathbb {E}^{x} V(X'_t,y^*)\le K_t (V(x,y^*)+1). \end{aligned}$$
(3.1)

Theorem 3.4

Let \(Z_t=(X_t,Y_t)\) be a diffusion with random switching that admits a Lyapunov function V. Suppose the transition rates satisfy \(\bar{\lambda }(z)\le M V(z)\), moreover there is a regime \(y_h\in F\) such that it is commonly accessible and has polynomial growth for V. Then if the SDE given by \(\hbox {d}X'_t=b(X'_t,y_h)\hbox {d}t+\sigma (X'_t,y_h)\hbox {d}W_t\) satisfies the minorization condition in Theorem 3.1, the diffusion \(Z_t\) has an invariant measure \(\pi \) and is geometrically ergodic under the total variation distance.

The proof is located in Sect. 4, where we will also show a simple way to verify the common accessibility of one regime.

3.2 Wasserstein metric convergence with contraction on average

Another mechanism that may generate geometric ergodicity is contraction. Contraction can be formulated through the Lyapunov exponents of stochastic flows. Recall that in the diffusion part, b is required to be \(\mathcal {C}^{2+\delta }\) and \(\sigma \) is required to be \(\mathcal {C}^{1+\delta }\) in x, and there is no explosion. As a consequence, the solution to the SDE with \(Y_t=y\) can be written as \(X_t=\Psi ^{y,\omega }_t X_0\), where \(\Psi ^{y,\omega }_t\) is a diffeomorphism [27]. We say \(\rho :F\mapsto \mathcal {R}\) is a contraction rate function, if

$$\begin{aligned} \Vert D_x \Psi _t^{y,\omega } x\Vert \le \exp (-\rho (y)t),\quad \forall (x,y)\in \mathcal {R}^d\times F,\quad \mathrm{a.s.} \end{aligned}$$
(3.2)

Here \(\omega \) can be seen as a realization of the Wiener process in (1.1), and we denote its law as \(P_W\). One way to verify (3.2) is imposing the following requirement on (1.1):

$$\begin{aligned} (x-x')\cdot (b(x,y)-b(x',y))\le -\rho (y)|x-x'|^2,\quad \sigma (x, y)=\sigma (y). \end{aligned}$$

[7, 46] and the references within also suggest other methods to establish (3.2) under possibly a different norm, and our method below is possible for generalization in those scenarios as well. We will say the joint process admits a contraction on average, if there are constants \(m,\bar{\rho },C_\rho >0\) such that

$$\begin{aligned} \mathbb {E}^z \exp \bigg (-m\int ^t_0 \rho (Y_s){\text {d}}s\bigg )\le C_\rho \exp (-\bar{\rho } t). \end{aligned}$$
(3.3)

The next subsection will discuss how to verify (3.3) given the contraction rates \(\rho \).

The total variation norm is often too stringent to capture contraction. For example, consider a trivial deterministic process in \(\mathcal {R}\), \(\hbox {d}X_t=-\rho X_t \hbox {d}t\). The invariant measure is obviously \(\delta _0\), a point mass at the origin, and it attracts other points. Yet, starting from any nonzero point, the distribution of \(X_t\) is a point mass at \(e^{-\rho t}X_0\), which has total variation distance 2 from \(\delta _0\).

A more suitable distance for our purpose is the Wasserstein distance, which is also used in previous works for models with bounded or Lipschitz transition [3, 4, 7, 36]. For any distance d on a space E, the associated Wasserstein distance between two measures \(\mu \), \(\nu \) on E is defined as:

$$\begin{aligned} \hbox {d}(\mu ,\nu ):=\inf _{\Gamma \in \mathcal {C}(\mu ,\nu )} \int \hbox {d}(x,x') \Gamma (\hbox {d}x,\hbox {d}x') \end{aligned}$$
(3.4)

Here \(\mathcal {C}(\mu ,\nu )\) is the set of all coupling measures between \(\mu \) and \(\nu \).

The distance function here can be very flexible. One remarkable discovery in [17, 19] is that by properly incorporating the Lyapunov function into d, the corresponding Wasserstein distance can characterize relatively weak convergence. This is known as the asymptotic coupling framework. As for diffusions with random switching, this framework allows us to generalize geometric ergodicity results to cases where the transition rates and their derivatives are bounded only by the Lyapunov function.

Similar to the situation with dissipation, the notion of contraction on average is essential for our discussion. The precise statement is the following:

Theorem 3.5

Let \(Z_t=(X_t,Y_t)\) be a diffusion with random switching that admits a Lyapunov function V. Suppose the following four conditions hold

  1. 1.

    V(xy) has polynomial growth in x, so there are nM such that

    $$\begin{aligned} V(x+u, y)\le M(V(x)+|u|^{n})\quad \text {and}\quad V(x,y)\ge \frac{1}{M}|x|^{\frac{1}{n}}. \end{aligned}$$
  2. 2.

    The transition rates and their derivatives are bounded by MV with a constant \(M>0\)

    $$\begin{aligned} \bar{\lambda }(z)\le MV(z),\quad \sum _{y'}|\nabla _x \lambda (x,y,y')|\le MV(z). \end{aligned}$$
  3. 3.

    Each regime admits a contraction rate \(\rho \) in the sense of (3.2), and the averaged dynamics is contractive as there are \(C_\rho ,m,\bar{\rho }>0\):

    $$\begin{aligned} \mathbb {E}^z \exp \bigg (-m\int ^t_0 \rho (Y_s){\text {d}}s\bigg )\le C_\rho \exp (-\bar{\rho } t). \end{aligned}$$
  4. 4.

    There is a commonly accessible regime \(y_c\), such that \(\rho (y_c)>0\) and V has at most polynomial growth in the sense of Definition 3.3.

Then \(Z_t\) has a unique invariant measure \(\pi \); moreover, the distribution of \(Z_t\) converges to \(\pi \) geometrically fast in the Wasserstein distance generated by \(m(z,z')=1_{y=y'}\wedge |x-x'|+1_{y\ne y'}\). In particular the following bound holds with come \(C,\beta >0,\)

$$\begin{aligned} m(P^*_t\delta _z, P^*_t\delta _z)\le C e^{-\beta t}(1+V(z)). \end{aligned}$$

3.3 Contraction on average

Given the contraction rates \(\rho \) in each regime, the contraction on average condition (3.3) can be verified using arguments similar to the ones of dissipation on average in Sect. 2. Yet, there is an important difference. When we construct a Lyapunov function, it suffices to consider the transitions for large |x|, and upper bounds suffice to hold modulo a constant, see Lemma 6.2. This is no longer the case for contraction on average, and we will need the “\(\lesssim \)” inequalities in Sect. 2 to hold with “\(\le \).” As a consequence, the spectrum and comparison arguments still work with a variant, but the scaling argument no longer works.

The following theorem is the contraction version of Theorem 2.2, while the transition rates are allowed to be nonconstant.

Theorem 3.6

If there are constants \(m,\bar{\rho }>0\) and a vector a with strictly positive components such that the following holds

$$\begin{aligned} -m\rho (y)a(y)+[\Lambda (x)a]_y\le -\bar{\rho }a(y) \end{aligned}$$
(3.5)

for all \((x,y)\in E\), then the contraction on average (3.3) also holds. In particular, if \(\Lambda (x)\) is a constant irreducible transition matrix, and over its invariant measure \(\pi \), \(\sum \pi (y)\rho (y)>0\), then the conditions mentioned above hold.

Proof

Consider an auxiliary scalar process \(\hbox {d}U_t=-m \rho (Y_t)U_t\hbox {d}t\) with \(U_0=1\) then clearly

$$\begin{aligned} U_t=\exp \left( -m \int ^t_0 \rho (Y_s){\text {d}}s\right) >0. \end{aligned}$$

Now consider a joint function \(V(z,u)=a(y)u\), which satisfies the following for any \(u>0\),

$$\begin{aligned} \mathcal {L}V(z,u)=-m\rho (y)a(y)u+[\Lambda (x)a]_yu\le -\bar{\rho }a(y)u=-\bar{\rho }V(z,u). \end{aligned}$$

As a consequence, \(\mathbb {E}U_t\le e^{-\bar{\rho }t}\frac{\max _y a(y)}{\min _y a(y)}\). \(\square \)

Following the representation (2.10), we can see a(y) in (3.5) as the potential contraction of one regime. Therefore, the comparison principle can be formulated as follows:

Proposition 3.7

Let \((\rho , \Lambda (x))\) satisfies (3.5) with a strictly positive vector a, and then \((\tilde{\rho }, \widetilde{\Lambda }(x) )\) satisfies the same inequality, if it is more favorable for contraction in the sense:

$$\begin{aligned} \tilde{\rho }(y)\ge \rho (y),\quad (\lambda (x,y,y')-\tilde{\lambda }(x,y,y'))(a(y)-a(y'))\ge 0. \end{aligned}$$

The proof is a direct verification and ignored here. On the other hand, contraction on average will require the contraction to hold homogenous inside \(\mathcal {R}^d\), but not just for large enough |x|. This is probably an intrinsic requirement due to the following example:

Example 3.8

Let \(F=\{1,-1\}\), and \(\hbox {d}X_t=-Y_t X_t \hbox {d}t\) to be a scalar process, while

$$\begin{aligned} \lambda (x, 1,-1)=x^2+3,\quad \lambda (x,-1,1)=3x^2+1. \end{aligned}$$

With a simple application of Theorem 2.4, it is easy to see this system has dissipation on average. Indeed, the transition rates favor contraction when |x| is large. However, when |x| is close to 0, the transition favors inverse contraction. As a consequence, although \(\frac{1}{4}\delta _{(0, 1)}+\frac{3}{4}\delta _{(0,-1)}\) is clearly an invariant measure, starting from any \(x_0\ne 0\), \(x_t\) will never reach 0. In fact, there is at least another invariant measure on \(\mathcal {R}^+\) with density:

$$\begin{aligned} p(x,1)=p(x,-1)=x\exp (-x^2),\quad x>0. \end{aligned}$$

The invariance can be verified by the Fokker–Planck equation, which is the dual of \(\mathcal {L}\). On the other hand, if we replace the ODE of \(X_t\) with \(\hbox {d}X_t=-y_t X_t\hbox {d}t+\hbox {d}W_t\), the noise will connect the invariant measures, so the process becomes ergodic due to Theorem 3.4.

4 Geometric ergodicity through random PDMPs

4.1 Random PDMPs

Piecewise deterministic Markov processes (PDMP) are special cases of diffusions with random switching, as they require the SDEs of \(X_t\) to be ODEs, in other words, \(\sigma (Z_t)\equiv 0\). In this case, assuming \(Y_s\) stays in regime y up to time t, the value of \(X_t\) is given by a diffeomorphism \(X_t=\Psi ^y_{s,t}X_s\) for any \(s\le t\). A PDMP can be defined, based on \(\Psi ^y_{s,t}\) and transition rates \(\lambda (x,y,y')\). In fact, \(X_t\) is completely governed by the jumps of \(Y_t\): if \(Y_t\) has jumps at time \(\mathbf {t}=(t_1,\ldots , t_n)\) with a jump sequence \(\mathbf {y}=(y_1,\ldots ,y_n)\), then \(X_t\) is given by

$$\begin{aligned} X_t=\Psi (X_0, \mathbf {t}, \mathbf {y}, t):=\Psi ^{y_n}_{t_n, t}\circ \Psi ^{y_{n-1}}_{t_{n-1},t_n}\circ \cdots \circ \Psi ^{y_0}_{0,t_1}X_0. \end{aligned}$$
(4.1)

Moreover, the probability density of such event, that is having n jumps before time t with jump times \(\tau _k\) and jumps going to \(y_k\), is given by formula 3.10 of [20]

$$\begin{aligned} p^{z_0,t}_{n,\mathbf {t},\mathbf {y}}\hbox {d}\mathbf {t}:= \mathbbm {1}_{t_1<t_2<\cdots <t_n}\exp \left( -\int ^t_0 \bar{\lambda }(z_s){\text {d}}s\right) \prod _{i=1}^n(\lambda (z_{t_i-},y_{t_i})\hbox {d}t_{i}). \end{aligned}$$
(4.2)

In (4.2), the X part of \(z_s\) is given by (4.1), and \(z_{s-}=\lim z_r\) with r approaches s from left.

On the other hand, diffusions with random switching can be viewed as random PDMPs. As noted in Sect. 3.2, the solution of the SDE in a fixed regime can be written as \(X_t=\Psi ^{y,\omega }_{s,t} X_s\), where \(\omega \) denotes the realization of the Wiener process \(W_s\), and we denote the law of \(\omega \) as \(P_W\). Therefore, if we condition on each realization \(\omega \), a diffusion with random switching is simply a PDMP. Following the nomenclature of statistical physics [47], this PDMP will be called a quenched process, as it is the conditioning of the original joint process \(Z_t\), on one realization of random outcome \(\omega \). In contrast, the original process without conditioning will be called the annealed process. We will adopt this simple terminology.

Viewing a diffusion with random switching as a random PDMP gives us two explicit formulas. First, given the jump times \(\mathbf {t}\) and jumps \(\mathbf {y}\), \(X_t\) is given by \(\Psi _\omega (X_0, \mathbf {t}, \mathbf {y}, t)\), which is defined in (4.1) with \(\Psi ^y_{s,t}\) replaced by \(\Psi ^{y,\omega }_{s,t}\). Second, if \(p^{z_0,t,\omega }_{n,\mathbf {t}, \mathbf {y}}\) denotes the density (4.2) with the diffeomorphisms being \(\Psi ^{y,\omega }_{s,t}\), then the law of the annealed process can be recovered by averaging over \(P_W\):

$$\begin{aligned} \mathbb {P}^{z_0}(Z_{s\le t}\in A)=\int P_W(\text {d}\omega )\sum _{n=0}^\infty \sum _{\mathbf {y}\in F^n} \int _{[0,t]^n} \hbox {d}\mathbf {t}p^{z_0,t,\omega }_{n,\mathbf {t}, \mathbf {y}}\mathbbm {1}_{z_{s\le t}\in A}, \end{aligned}$$
(4.3)

where \(z_s\) has its x part given by \(x_s=\Psi _\omega (x_0, \mathbf {t}, \mathbf {y}, s)\) and \(y_s=y_k\) if \(t_k\le s\le t_{k+1}\). These explicit formulas will be instrumental for our proofs below.

4.2 Accessibility analysis

In both the minorization and contraction on average scenarios, we need a good regime to be commonly accessible. In this section we will discuss the consequence of this assumption and also provide a simple way to verify in Lemma 4.2. Most derivations here are relatively standard and may have a simpler version in [4, 5, 7] when the transition rates are bounded. We provide the complete proofs here to be self-contained.

Lemma 4.1

Suppose \(Z_t\) admits a Lyapunov function V, \(\Lambda (x)\) is continuous in x, while \(\bar{\lambda }\le MV\) for some \(M>0\).

  1. 1.

    If \(\mathbb {P}^z(Y_t=\tilde{y})>0\), then \(\mathbb {P}^z(Y_{t+s}=\tilde{y})>0\) for any \(s\ge 0\);

  2. 2.

    For each z and \(t>0\), there exists a neighbor of x, \(O_x\subset \mathbb {R}^d\), such that

    $$\begin{aligned} \mathbb {P}^{x',y}(Y_t=\tilde{y})\ge \frac{1}{2}\mathbb {P}^z(Y_t=\tilde{y}),\quad \forall x'\in O_x; \end{aligned}$$
  3. 3.

    If there is a \(\tilde{y}\in F\) that is commonly accessible, then for any fixed compact set C, there exists some \(t_0,m_0>0\) such that

    $$\begin{aligned} \mathbb {P}^z(Y_{t_0}=\tilde{y})\ge m_0,\quad \forall z\in C. \end{aligned}$$

Proof

Claim 1 Our condition implies \(p^{z,t,\omega }_{n,\mathbf {t},\mathbf {y}}>0\) with \(y_n=\tilde{y}\) for certain \(\omega , \mathbf {t}, \mathbf {y}\). Then observe that,

$$\begin{aligned} p^{z,t+s,\omega }_{n,\mathbf {t},\mathbf {y}}=p^{z,t,\omega }_{n,\mathbf {t},\mathbf {y}}\exp \bigg (-\int ^s_0 \bar{\lambda }(\Psi ^{y_n,\omega }_{t,t+r}x_t, y_n)\text {d}r\bigg ). \end{aligned}$$

The exponential term above is nonzero for \(P_W\)-a.s. \(\omega \), because (2.3) leads to

$$\begin{aligned} \int ^s_0 \int P_W(\text {d}\omega )\bar{\lambda }(\Psi ^{y_n,\omega }_{t,t+r}x_t, y_n)\text {d}r&\le M \int \int ^s_0 P_W(\text {d}\omega ) V(\Psi ^{y_n,\omega }_{t,t+r}x_t, y_n)\text {d}r\\&\le sK_sM (V(x_t,y_n)+1)<\infty . \end{aligned}$$

So for \(P_W\)-a.s. \(\omega \), \(p^{z,t,\omega }_{n,\mathbf {t},\mathbf {y}}>0\) implies \(p^{z,t+s,\omega }_{n,\mathbf {t},\mathbf {y}}>0\), and annealing (4.3) produces our claim.

Claim 2 By formula (4.2), \(p^{z_0,\omega , t}_{n,\mathbf {t},\mathbf {y}}\) depends continuously on \(x_{s\le t}\), which depends continuously on \(x_0\) because of (4.1). Thus \(p^{z_0,\omega , t}_{n,\mathbf {t},\mathbf {y}}\) depends continuously on \(x_0\), and Claim 2 follows by applying Fatou’s lemma to the following annealing formula over any sequences \(z'\rightarrow z\).

$$\begin{aligned} \mathbb {P}^{z'}(Y_t=\tilde{y})=\int P_W(\text {d}\omega ) \hbox {d}\mathbf {t}\sum _{n,\mathbf {y}:y_n=\tilde{y}}p^{z',t,\omega }_{n,\mathbf {t},\mathbf {y}}. \end{aligned}$$

Claim 3 By Claim 2 and the fact that C is compact, we can find a finite cover of C, \(\{O_i\}_{i=1,\ldots , n}\), and a sequence of time \(t_{i}\) such that

$$\begin{aligned} \mathbb {P}^{z'}(Y_{t_i}=\tilde{y})>0\quad \forall z'\in O_i. \end{aligned}$$

Let \(t_0=\max \{t_i\}\), we have \(\mathbb {P}^z(Y_{t_0}=\tilde{y})>0\) for all \(z\in C\) by Claim 1. Then using the compactness again with Claim 2, we can find a uniform lower bound \(m_0\) for the transition probability. \(\square \)

The following Lemma provides an easy verification that a regime \(y^*\) is commonly accessible.

Lemma 4.2

(Burst mechanism) Under the same conditions of Lemma 4.1, fix any \(z_0\in E\) and a sequence in F \(y_0,y_1,\ldots , y_n\) such that

$$\begin{aligned} \lambda (x_0,y_i,y_{i+1})>0,\quad i=0,1,\ldots ,n-1. \end{aligned}$$
(4.4)

Then for any \(t>0\), \(\mathbb {P}^{z_0}(Y_t=y_n)>0.\) Therefore, if there is a \(y^*\in F\) such that for any \(z_0\in E\), there is a sequence \(y_0,\ldots ,y_n=y^*\) such (4.4) holds, then \(y^*\) is commonly accessible.

Proof

By claim 1 of Lemma 4.1, it suffices to show our claim for sufficiently small t. Since \(\lambda \) is continuous in x, so we can find \(0<\delta <1\) and an \(M>0\) such that the following holds:

$$\begin{aligned} \lambda (x,y_i,y_{i+1})>0, \quad \bar{\lambda }(x,y_i)<M,\quad \forall \Vert x-x_0\Vert \le \delta , i=0,1,\ldots ,n-1. \end{aligned}$$

Then for \(P_W\)-a.s. realization of \(\omega \), because \(\Psi ^{y,\omega }_{s,t} x\) is continuous in st and x, by induction there is a sequence of measurable functions \(\epsilon _k(\omega )\le \delta , k=0,1,\ldots ,n\) such the following holds:

$$\begin{aligned} |\Psi ^{y_{k-1},\omega }_{s,t} x-x_0|<\epsilon _k(\omega ),\quad \forall x: |x-x_0|<\epsilon _{k-1}(\omega ),\,s\le t\le \epsilon _k(\omega ), \end{aligned}$$

where \(\epsilon _n(\omega )=\delta \). Then pick any \(\epsilon \) such that \(P_W(\epsilon _0(\omega )>\epsilon )>0\), and consider any fixed jump time sequence \(\mathbf {t}=(t_1,t_2,\ldots , t_n)\) with \(t_n<\epsilon \) and the generated process \(x_s=\Psi _\omega (x_0,\mathbf {t},\mathbf {y},s)\). with \(\mathbf {y}=(y_1,\ldots ,y_n)\). It is easy to verify that if \(\epsilon _0(\omega )>\epsilon \), then \(|x_s-x_0|<\delta \) for all \(s\le \epsilon \); therefore, \(p^{z_0,\omega ,\epsilon }_{n,\mathbf {y},\mathbf {t}}>0\) for these \(\omega \). This completes the proof by the annealing formula (4.3). \(\square \)

4.3 Ergodicity with a minorization regime

Due to Theorem 3.1, the proof of Theorem 3.4 is a relative standard verification of the small set condition for the full Markov semigroup \(P_t\).

Proof of Theorem 3.4

In order to apply Theorem 3.1, it suffices for us to show the minorization condition. By the equivalence between total variation and coupling measures, this is equivalent to building a coupling of \(Z_t\) and \(Z_t'\) such that \(P(Z_t=Z_t')\ge \epsilon \), if \(V(Z_0), V(Z'_0)\le C\). Our strategy will be showing there are \(t_0>0\) and \(\delta >0\),

$$\begin{aligned} \mathbb {P}^z(Y_{t_0}=y_h, V(Z_{t_0})\le C)>\delta , \quad \forall z\in U, \end{aligned}$$
(4.5)

and then use the fact that \(Y_t\) is possible to stay as \(y_h\), while the minorization in regime h can be used to build a coupling.

Since \(y_h\) is commonly accessible, by Lemma 4.1(3), there are \(t_0,m_0\), such that \(\mathbb {P}^z(Y_{t_0}=y_h)>m_0\) for all \(z\in \mathcal {R}^d\times F\). Because V is a Lyapunov function, \(\mathbb {E}^z(V(Z_{t_0}))\le e^{-\gamma t_0}C+K/\gamma \) for all C. By picking a large C, we can make

$$\begin{aligned} \mathbb {P}^z(V(Z_{t_0})> C)\le \frac{\mathbb {E}^z V(Z_{t_0})}{C}\le \frac{1}{2}m_0, \end{aligned}$$

hence (4.5) holds with \(\delta =\frac{1}{2}m_0\). By Lemma 6.1, there is a coupling of \(Z_{t_0}\) and \(Z'_{t_0}\) with law \(\mathbb {Q}^z\) such that

$$\begin{aligned} \mathbb {Q}^z((Z_{t_0},Z_{t_0}')\in A_1)>\delta ,\quad A_1:=\{(z,z'): y=y'=y_h, V(z),V(z')\le C\}. \end{aligned}$$

By our assumption, \(y=y_h\) satisfied the minorization condition. This means there are \(t_1>0\) and \(\epsilon >0\), so that for any \(x,x'\) that \(V(x,y_h), V(x',y_h)\le C\), there is a coupling of \(P_W(\omega )\) and \(P_W(\omega ')\), denoted by \(Q_W(\text {d}\omega , \text {d}\omega ')\) such that

$$\begin{aligned} \mathbb {E}^{x,x'}_{y_h}(X_{t_1}=X'_{t_1})=\int Q_W (\text {d}\omega , \text {d}\omega ') \mathbbm {1}_{\Psi ^{y_h,\omega }_{0,t_1}x=\Psi ^{y_h,\omega '}_{0,t_1}x'}\ge \epsilon . \end{aligned}$$
(4.6)

Now we extend \(\mathbb {Q}^z\) from time \(t_0\) to \(T=t_0+t_1\) by coupling \(\omega ,\omega '\) as \(Q_W\) after time \(t_0\). Then Markov property yields

$$\begin{aligned} \mathbb {Q}^z(Z_{T}=Z'_T)&\ge \mathbb {Q}^z(Y_{t_0\le s\le T}=Y'_{t_0\le s\le T}=y_h, V(Z_{t_0}),V'(Z_{t_0})\le C, X_{T}=X'_T)\\&\ge \frac{1}{2}m_0 \inf _{(z,z')\in A_1} \int Q_W(\text {d}\omega ,\text {d}\omega ) p^{z,t_1,\omega }_{0,\emptyset ,\emptyset }p^{z',t_1,\omega '}_{0,\emptyset ,\emptyset }\mathbbm {1}_{\Psi ^{y_h,\omega }_{0,t_1}x= \Psi ^{y_h,\omega '}_{0,t_1}x}. \end{aligned}$$

It suffices to show the density that there are no jumps till time \(t_1\), which is \(p^{z,t_1,\omega }_{0,\emptyset ,\emptyset }p^{z',t_1,\omega '}_{0,\emptyset ,\emptyset }\), is bounded from below on a set of probability more than \(1-\frac{1}{2}\epsilon \), then union bound with (4.6) will yield our claim. For this purpose, note

$$\begin{aligned} p^{(x,y_h),t_1,\omega }_{0,\emptyset ,\emptyset }=\exp \left( -\int ^{t_1}_0 \bar{\lambda }(x_s,y_h) {\text {d}}s\right) , \end{aligned}$$

while by polynomial growth of V within regime \(y_h\),

$$\begin{aligned} \int P_W(\text {d}\omega ) \int ^{t_1}_0 \bar{\lambda }(x_s,y_h) {\text {d}}s\le M\int P_W(\text {d}\omega ) \int ^{t_1}_0 V(x_s,y_h) {\text {d}}s \le Mt_1K_{t_1}(V(x_0)+1). \end{aligned}$$

So by Markov’s inequality, there is an N such that

$$\begin{aligned} P_W\left( p^{z,t_1,\omega }_{0,\emptyset ,\emptyset }\le \exp (-N)\right) \le P_W\left( \int ^{t_1}_0 \bar{\lambda }(x_s,y_h) {\text {d}}s\ge N\right) \le \frac{1}{4}\epsilon . \end{aligned}$$

Then because \(Q_W\) is a coupling, by union bound,

$$\begin{aligned} Q_W\left( p^{z,t_1,\omega }_{0,\emptyset ,\emptyset }p^{z',t_1,\omega '}_{0,\emptyset ,\emptyset }\le \exp (-2N)\right) \le \frac{1}{2}\epsilon . \end{aligned}$$

4.4 Ergodicity with contraction on average

The proof of Theorem 3.5 uses the asymptotic coupling mechanism introduced by [18, 19]. Theorem 4.8 of [19] presented below formulates our application of this mechanism.

Theorem 4.3

Let \(P_t\) be a Markov semigroup over a Polish space E admitting a continuous Lyapunov function V, so \(\mathbb {E}V(Z_t)\le e^{-\gamma t}\mathbb {E}V(Z_0)+K\). Suppose there exists a distance-like function \(d:E\times E\mapsto [0,1]\) and a time t such that

  1. 1.

    \(P_t\) is locally contracting in d:

    $$\begin{aligned} d(P_t^* \delta _z, P_t^*\delta _{z'})\le \frac{1}{2}d(z,z'),\quad \forall d(z,z')<1. \end{aligned}$$
  2. 2.

    Smallness: for any two \(z,z'\) such that \(V(z),V(z')\le K\), \(d(P_t^* \delta _z, P_t^*\delta _{z'})\le 1-\epsilon \).

Then \(P_t\) can have at most one invariant probability measure \(\pi \). Furthermore, let \(\tilde{d}(z, z') = \sqrt{d(z, z')(1 + V (z) + V (z'))}\), there exists a \(t > 0\) such that \(\tilde{d}(P^*_t\mu ,P^*_t \nu )\le \frac{1}{2}\tilde{d}(\mu ,\nu )\) for any probability measures \(\mu , \nu \) on E.

In [19], \(d: E\times E\mapsto \mathcal {R}^+\) is distant-like if it is symmetric, lower semicontinuous and \(d(z,z')=0 \Leftrightarrow z=z'\). Its associated Wasserstein-1 distance is also denoted by d for notational simplicity. In other words, for two probability measures \(\mu \) and \(\nu \),

$$\begin{aligned} \hbox {d}(\mu ,\nu ):=\inf _{\Gamma }\int \hbox {d}(z,z')\Gamma (\hbox {d}z,\hbox {d}z'), \end{aligned}$$

where the infimum is taken over all coupling measures of \(\mu \) and \(\nu \).

The reason that (1) is called a local contraction, is that in most applications, \(\hbox {d}(z,z')=1\) unless z and \(z'\) are very close. Theorem 4.3 essentially extends a local contraction to a global one.

4.4.1 Contracting distance

For the construction of a contracting distance, we have the following lemma. It is a variant of Lemma 4.13 in [19] which uses a Lyapunov function instead of a super Lyapunov function. The proof goes very similar.

Proposition 4.4

Under the conditions of Theorem 3.5, the following distance with any positive \(r\le \min \{\frac{m}{2}, \frac{1}{2}\}\) is locally \(\tfrac{1}{2}\)-contracting for \(P_T\) with a proper T and \(\delta <2^{-\frac{1}{r}}\):

$$\begin{aligned} \hbox {d}(z,z')=1_{y\ne y'}+ 1_{y=y'}\wedge \delta ^{-1}\bigg (\inf _{\theta :x\rightarrow x'}\int ^1_0 V(\theta (s),y))|\dot{\theta }(s)|{\text {d}}s\bigg )^r. \end{aligned}$$
(4.7)

Here the infimum is taken over all the \(C^1\) paths \(\theta :[0,1]\mapsto \mathcal {R}^d\) that connects x and \(x'\).

Before we move on to the proof of Proposition 4.4, we need two pieces of arguments. The first one indicates \(d(z,z')\approx 1_{y\ne y'}+ 1_{y=y'}\wedge \delta ^{-1}(V(z)|x-x'|)^r.\)

Lemma 4.5

Assuming condition (1) of Theorem 3.5, fix any \(y\in F\), \(x,x'\in \mathbb {R}^d\) and a \(C^1\) path \(\theta : [0,1]\mapsto \mathcal {R}^d\) that connects x and \(x'\) so that

$$\begin{aligned} \inf _{\theta :x\rightarrow x'}\int ^1_0 V(\theta (s),y)|\dot{\theta }(s)|{\text {d}}s\le \frac{1}{2}, \end{aligned}$$

then

$$\begin{aligned} \frac{1}{2M} |x-x'|V(z)\le \inf _{\theta :x\rightarrow x'}\int ^1_0 V(\theta (s),y)|\dot{\theta }(s)|{\text {d}}s\le 2M |x-x'|V(z). \end{aligned}$$

Proof

Recall that we can always assume that \(V\ge 1\), so our condition leads to \(|x-x'|\le \frac{1}{2}\). Because of the polynomial growth of V in x, when \(|u|\le 1\), \(V(x+u,y)\ge M^{-1}V(x,y)-1\), so

$$\begin{aligned} V(x+u,y)\ge \frac{1}{2}[(M^{-1}V(x,y)-1)+1]=\frac{1}{2M}V(x,y),\quad \forall |u|\le 1. \end{aligned}$$

Therefore for any \(C^1\) path \(\theta \) that connects x with \(x'\), while \(|x-x'|\le \tfrac{1}{2}\), if it completely lies in \(B_x(1)=\{x+u: |u|\le 1\}\), then

$$\begin{aligned} \int ^1_0 V(\theta (s),y)|\dot{\theta }(s)|{\text {d}}s\ge \frac{1}{2M}V(z)\int ^1_0 |\dot{\theta }(s)|{\text {d}}s\ge \frac{1}{2M}V(z)|x-x'|. \end{aligned}$$

In the other case, if \(\theta \) has a part lying outside \(B_x(1)\), then there is an exiting time of \(B_x(1)\), \(\tau =\inf \{s: \theta (s)\notin B_x(1)\}\). By definition \(\theta ([0,\tau ])\) is a \(C^1\) path of length at least 1, hence

$$\begin{aligned} \int ^1_0 V(\theta (s),y)|\dot{\theta }(s)|{\text {d}}s\ge \int ^\tau _0 V(\theta (s),y)|\dot{\theta }(s)|{\text {d}}s \ge \frac{1}{2M}V(z)\ge \frac{1}{2M}V(z)|x-x'|. \end{aligned}$$

For the other side of the bound, we only need to show it holds for \(\theta (s)=x+s(x-x')\), which is the following by polynomial growth:

$$\begin{aligned} |x-x'|\int ^1_0 V(\theta (s),y){\text {d}}s\le M|x-x'|(V(z)+1). \end{aligned}$$

\(\square \)

The second lemma gives a bound on the perturbation of measures caused by perturbation on the initial condition:

Lemma 4.6

Under the conditions of Theorem 3.5, for any fixed T there is a constant \(D_T\) such that

$$\begin{aligned} \int P_W(\hbox {d}\omega ) \sum _{n,\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t}\Vert D_x p^{z,\omega , T}_{n, \mathbf {t},\mathbf {y}}\Vert \le D_T(V(z)+1). \end{aligned}$$

Proof

Recall that the density function is

$$\begin{aligned} p^{z,\omega ,T}_{n,\mathbf {t},\mathbf {y}}=1_{t_1<\cdots<t_n<T}\exp \bigg (-\int ^{T}_0 \overline{\lambda }_s(z_s){\text {d}}s\bigg )\prod _i^n(\lambda (z_{t_i-},y_{t_i}) \hbox {d}t_i). \end{aligned}$$

Applying Fréchet derivative and the chain rule, we have

$$\begin{aligned}&\int P_W(\text {d}\omega ) \sum _{n,\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t} \Vert D_x p^{z,\omega ,T}_{n,\mathbf {t},\mathbf {y}} \Vert \nonumber \\&\quad \le \int P_W(\text {d}\omega )\sum _{n,\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t} p^{z,\omega ,T}_{n,\mathbf {t},\mathbf {y}}\int ^T_0\Vert D_x \Psi _\omega (x,\mathbf {t},\mathbf {y},s)\partial _x \overline{\lambda }(x_s,y_s)\Vert {\text {d}}s\nonumber \\&\qquad +\int P_W(\text {d}\omega )\sum _{n,\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t} p^{z,\omega ,T}_{n,\mathbf {t},\mathbf {y}}\sum _{k=1}^n\Vert D_x \Psi _\omega (x,\mathbf {t},\mathbf {y},t_k)\Vert \frac{\Vert \partial _x \lambda (z_{t_{t_k-}},y_{t_k})\Vert }{\lambda (z_{t_{k}-},y_{t_k})}. \end{aligned}$$
(4.8)

We will bound the two parts separately in below. Since for \(t_k\le s<t_{k+1}\) and every \(\omega \)

$$\begin{aligned} \Vert D_x\Psi _\omega (x,\mathbf {t},\mathbf {y},s)\Vert= & {} \Vert D_x\Psi ^{y_k,\omega }_{t_k,s}\circ \Psi ^{y_{k-1},\omega }_{t_{k-1},t_{k}}\circ \cdots \circ \Psi ^{y_0,\omega }_{0,t_1}x_0\Vert \\\le & {} \exp \bigg (-\int ^s_0 \rho (y_{s}){\text {d}}s\bigg )\le \exp (M_\rho T). \end{aligned}$$

Here \(M_\rho :=\max _y\{-\rho (y)\}<\infty \). Using condition (2) of Theorem 3.5, the first part of (4.8) is bounded by the following

$$\begin{aligned}&M\int P_W(\text {d}\omega )\sum _{n,\mathbf {y}}\int \hbox {d}\mathbf {t} p^{z,\omega ,T}_{n,\mathbf {t},\mathbf {y}}\int ^T_0 \exp (M_\rho T)V(z_s){\text {d}}s \\&\qquad = M\exp (M_\rho T) \mathbb {E}^z\int ^T_0 V(Z_s){\text {d}}s, \end{aligned}$$

which is bounded further by \(K_T V(z)\) with a proper \(K_T\) since V is a Lyapunov function. The equality above holds as we recognize the probabilistic meaning of the integrals and then use that V is a Lyapunov function. For the second part of (4.8), according to condition (2) of Theorem 3.5, it is clearly bounded by

$$\begin{aligned}&M\exp (M_\rho T)\int P_W(\text {d}\omega )\sum _{n,\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t} p^{z,\omega ,T}_{n,\mathbf {t},\mathbf {y}}\sum _{k=1}^n \frac{V(z_{t_{k-1}})}{\lambda (z_{t_{k}-},y_{t_k})} \\&\quad =M\exp (M_\rho T)\mathbb {E}^z \sum _{k:\tau _k\le T} \frac{V(z_{\tau _k-})}{\lambda (Z_{\tau _k-},Y_{\tau _k})}, \end{aligned}$$

where \(\tau _k\) are the sequential jump times. Apply formula 31.18 in [10] with \(b(z',z)=\frac{V(z)}{\lambda (z,y')}\) on the quenched PDMP, and then annealing, we find

$$\begin{aligned} \mathbb {E}^z \sum _{k:\tau _k\le T} \frac{V(z_{\tau _k-})}{\lambda (Z_{\tau _k-},Y_{\tau _k})}=\mathbb {E}^z\int ^T_0 V(Z_t)\hbox {d}t, \end{aligned}$$

which is bounded further by \(K_T V(z)\) with a proper \(K_T\) since V is a Lyapunov function.

\(\square \)

We are finally at the position to prove Proposition 4.4:

Proof of Proposition 4.4

By the definition of contracting metric, it suffices for us to show that \(\hbox {d}(P^*_T\delta _z,P^*_T\delta _{z'})\le \frac{1}{2}\hbox {d}(z,z')\) when \(\hbox {d}(z,z')<1\). This implies that \(y=y'\), and \(|x-x'|\le \frac{1}{2}\).

Since the spaces here are Polish, by the Kantorovich–Rubinstein theorem (Theorem 11.8.2 of [13]),

$$\begin{aligned} \hbox {d}(P^*_T\delta _z,P^*_T\delta _{z'})=\sup _\varphi \bigg \{P_T^{x,y} \varphi -P_T^{x',y} \varphi \bigg |\Vert \varphi \Vert _{Lip(d)}\le 1\bigg \}. \end{aligned}$$

\(\Vert \varphi \Vert _{Lip(d)}\) denotes the Lipschitz norm, in other words, for any \(z,z'\in E\), \(\varphi (z)-\varphi (z')\le \Vert \varphi \Vert _{Lip(d)} d(z,z')\). Hence, to prove this lemma we only need to show for any \(\varphi \), \(\Vert \varphi \Vert _{Lip(d)}\le 1\),

$$\begin{aligned} P_T^{x,y}\varphi -P_T^{x',y} \varphi \le \frac{1}{2\delta }\bigg (\inf _{r:x\rightarrow x'}\int ^1_0 V(r(s),y)|\dot{r}(s)|{\text {d}}s\bigg )^r. \end{aligned}$$
(4.9)

However, if \(\varphi \) has its d-Lipschitz norm less than 1, its maximum variation is less than 1, we can replace \(\varphi \) by \(\varphi -c\) such that \(\Vert \varphi \Vert _\infty \le \frac{1}{2}\), yet \(\int \varphi \hbox {d}Q^{x,y}-\varphi \hbox {d}Q^{x',y}\) remains invariant. So without loss of generality, we assume \(\Vert \varphi \Vert _\infty \le \frac{1}{2}\).

Consider splitting \(P_T^{x,y}\varphi -P_T^{x',y}\) into the difference caused by the initial condition, and the difference caused by the underlying probability measure:

$$\begin{aligned} |P_T^{x,y}\varphi -P_T^{x',y} \varphi |&= |\mathbb {E}^{x,y}\varphi (\Psi _\omega (x,Y_{s\le T},T), Y_T)-\mathbb {E}^{x',y}\varphi (\Psi _\omega (x',Y_{s\le T},T), Y_T)|\nonumber \\&\le |\mathbb {E}^{x,y}\varphi (\Psi _\omega (x,Y_{s\le T},T), Y_T)-\mathbb {E}^{x,y}\varphi (\Psi _\omega (x',Y_{s\le T},T), Y_T)| \end{aligned}$$
(4.10)
$$\begin{aligned}&\quad +|\mathbb {E}^{x,y}\varphi (\Psi _\omega (x',Y_{s\le T},T), Y_T)-\mathbb {E}^{x',y}\varphi (\Psi _\omega (x',Y_{s\le T},T), Y_T)|.\qquad \end{aligned}$$
(4.11)

The quantity \(\mathbb {E}^{x,y}\varphi (\Psi _\omega (x',Y_{s\le T},T), Y_T)\) has its probability initialized from point (xy), but the stochastic flow is initialized at \(x'\), in other words:

$$\begin{aligned} \mathbb {E}^{x,y}\varphi (\Psi _\omega (x',Y_{s\le T},T), Y_T) =\int P_W(\text {d}\omega ) \sum _{n,\mathbf {y}}p^{z,\omega , T}_{n, \mathbf {t},\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t} \varphi (\Psi _\omega (x',\mathbf {y},T),y_T). \end{aligned}$$

Since \(\varphi \) is d-Lipschitz, so the first part is bounded as below by Lemma 4.5

$$\begin{aligned} (4.10)&\le \mathbb {E}^{z}d((\Psi _\omega (x,Y_{s\le T},T),Y_T),(\Psi _\omega (x',Y_{s\le T},T),Y_T))\\&\le \mathbb {E}^{z} 1\wedge \delta ^{-1}\bigg (\inf _{\theta }\int ^1_0 V(\theta (s),Y_T)\Vert \dot{\theta }(s)\Vert {\text {d}}s \bigg )^r\\&\le \mathbb {E}^{z} \delta ^{-1}(2M)^r V(Z_T)^r|u_T|^r, \end{aligned}$$

where \(u_T=\Psi _\omega (x',Y_{s\le T}, Y_T)-X_T\). By Cauchy–Schwartz,

$$\begin{aligned} \mathbb {E}^{z}V(Z_T)^r|u_T|^r\le \sqrt{\mathbb {E}^z V(Z_T)^{2r}}\sqrt{\mathbb {E}^z |u_T|^{2r}}. \end{aligned}$$

Notice \(\mathbb {E}^z |u_T|^{2r}\le [\mathbb {E}^z |u_T|^{m}]^{\frac{2r}{m}}\), and by contraction on average:

$$\begin{aligned} \mathbb {E}|u_T|^m \le \mathbb {E}^z|x-x'|^m\exp \bigg (-\int ^T_0m\rho (Y_s){\text {d}}s\bigg )\le C_\rho |x-x'|^m\exp (-\bar{\rho } T). \end{aligned}$$

Likewise, \(\mathbb {E}^z |V(Z_T)|^{2r}\le [\mathbb {E}^z V(Z_T)]^{2r}\le C_0V(z)^{2r}\) for a constant \(C_0\). As a consequence, there is a constant \(C_1\) such that

$$\begin{aligned} (4.10)\le C_1 \exp \big (-\tfrac{2r}{m}\bar{\rho }T\big ) |x-x'|^{r} V(z)^r, \end{aligned}$$

which further leads to the following by Lemma 4.5 for a constant \(C_2\):

$$\begin{aligned} (4.10)\le \delta ^{-1}C_2 \exp \big (-\tfrac{2r}{m}\bar{\rho }T\big ) \bigg (\inf _{\theta :x\rightarrow x'}\int ^1_0 V(\theta (s),y))\Vert \dot{\theta }(s)\Vert {\text {d}}s\bigg )^r. \end{aligned}$$

In the following, we will fix a T such that \(C_2\exp \big (-\tfrac{2r}{m}\bar{\rho }T\big )\le \frac{1}{4}\). In order to bound (4.11), by definition:

$$\begin{aligned} (4.11)&= \bigg |\int P_W(\text {d}\omega ) \sum _{n,\mathbf {y}}(p^{z,\omega , T}_{n, \mathbf {t},\mathbf {y}}-p^{z',\omega , T}_{n, \mathbf {t},\mathbf {y}})\int _{[0,T]^n} \hbox {d}\mathbf {t} \varphi \big (\Psi _\omega (x',\mathbf {t},\mathbf {y},T),y_T\big )\bigg |\\&\le \Vert \varphi \Vert _\infty \int P_W(\text {d}\omega ) \sum _{n,\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t}\big |p^{z,\omega , T}_{n, \mathbf {t},\mathbf {y}}-p^{z',\omega , T}_{n, \mathbf {t},\mathbf {y}}\big |\\&\le \frac{1}{2} \int P_W(\text {d}\omega ) \sum _{n,\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t} \int ^1_0 \Vert D_x p^{z_s,\omega , T}_{n, \mathbf {t},\mathbf {y}}\Vert |\dot{\theta }(s)|{\text {d}}s. \end{aligned}$$

where \(z'=(x',y)\) and \(z_s=(\theta (s),y)\), and \(\theta \) is any \(C^1\) path from x to \(x'\). By Lemma 4.6, there is a constant \(D_T\) such that

$$\begin{aligned} \int P_W(\text {d}\omega ) \sum _{n,\mathbf {y}}\int _{[0,T]^n} \hbox {d}\mathbf {t}\Vert D_x p^{z_s,\omega , T}_{n, \mathbf {t},\mathbf {y}}\Vert \le D_TV(z_s). \end{aligned}$$

Hence for any \(C^1\) path \(\theta \) that connects x and \(x'\), while \(\int ^1_0 V(z_s)|\dot{\theta }(s)|{\text {d}}s\le 1\), because \(r<1\),

$$\begin{aligned} (4.11)\le \frac{D_T}{2}\int ^1_0 V(z_s)|\dot{\theta }(s)|{\text {d}}s\le \frac{D_T}{2}\bigg (\int ^1_0 V(z_s)|\dot{\theta }(s)|{\text {d}}s\bigg )^r. \end{aligned}$$

So combining the bounds for (4.10) and (4.11), we have

$$\begin{aligned} \mathbb {E}^{x,y}\varphi (Z_T)-\mathbb {E}^{x',y}\varphi (Z_T)\le \bigg (\frac{1}{4\delta }+\frac{D_T}{2}\bigg ) \bigg (\int ^1_0 V(z_s)|\dot{\theta }(s)|{\text {d}}s\bigg )^r; \end{aligned}$$

it suffices to take \(\delta \le D_T^{-1}\) in (4.9) for this proposition to hold.

4.4.2 Small set verification

Verification of condition (2) of Theorem 4.3 is given by the following

Lemma 4.7

Under the condition of Theorem 3.5, for any fixed strictly positive C, there are some \(T,\epsilon >0\) such that:

$$\begin{aligned} \hbox {d}(P^*_T\delta _z ,P^*_T\delta _{z'})\le 1-\epsilon , \quad \forall z,z':V(z),V(z')\le C. \end{aligned}$$

Proof

The proof is exactly the same as Theorem 3.4, and a coupling will be constructed through two steps. We first couple \(Y_{t_0}, Y'_{t_0}\) to \(y_c\), which follows exactly the same commonly accessible argument as in the proof of Theorem 3.4. Then we keep the value of Y to be the same afterward, until the contracting dynamics brings the X part close enough. For this part, it suffices to show that there is a \(t_1\) such that from any two x and \(x'\) with \(V(x,y_c), V(x',y_c)\le C\), there is a coupling \(Q_W\) of \(P_W(\text {d}\omega )\) and \(P_W(\text {d}\omega ')\), such that

$$\begin{aligned} \int Q_W(\text {d}\omega , \text {d}\omega ')\hbox {d}((\Psi ^{y_c, \omega }_{t_1} x, y_c),(\Psi ^{y_c,\omega '}_{t_1} x',y_c))\le 1-\epsilon . \end{aligned}$$
(4.12)

Then (4.12) replaces (4.6), while the remainder of the proof of Theorem 3.4 can be used again.

By Lemma 4.5, for all \(\omega \),

$$\begin{aligned} d((\Psi ^{y_c, \omega }_t x, y_c),(\Psi ^{y_c,\omega }_t x',y_c)) \le \delta ^{-1}(2M)^{r}\exp (-r\rho (y_c)t)V((\Psi ^{y_c, \omega }_t x),y_c)^r|x-x'|^r. \end{aligned}$$

Note that \(V(x,y_c), V(x',y_c)\le C\) implies \(|x-x'|\) is bounded, so for a t large enough,

$$\begin{aligned} d((\Psi ^{y_c, \omega }_t x, y_c),(\Psi ^{y_c,\omega }_t x',y_c))\le & {} \frac{\exp \big (-\frac{1}{2}r\rho (y_c)t\big )}{2C} V((\Psi ^{y_c, \omega }_t x),y_c)^r\\\le & {} \frac{\exp \big (-\frac{1}{2}r\rho (y_c)t\big )}{2C} V((\Psi ^{y_c, \omega }_t x),y_c). \end{aligned}$$

Therefore, if we let \(\omega =\omega '\), that is \(Q_W(\text {d}\omega , \text {d}\omega ')=P(\text {d}\omega )\delta _{\omega '=\omega }\), the left-hand side of (4.12) is bounded by (3.1)

$$\begin{aligned} \frac{\exp \big (-\tfrac{1}{2}r\rho (y_c)t\big )}{2C} \int P_W(\text {d}\omega )V((\Psi ^{y_c, \omega }_t x),y_c) \le \exp \big (-\tfrac{1}{2}r\rho (y_c)t\big )K_t. \end{aligned}$$

Since \(K_t\) has at most polynomial growth, a large \(t=t_1\) would produce (4.12). \(\square \)

4.4.3 Proof of Theorem 3.5

With the conditions of Theorem 4.3 verified, it is rather elementary to show Theorem 3.5.

Proof

By the existence of Lyapunov function, and Krylov–Bogoliubov’s theorem, there exists an invariant measure \(\pi \) with \(\mathbb {E}^\pi V= \int \pi (dz)V(z)<\infty \). Let d be as in Proposition 4.4 and \(\tilde{d}\) be defined by Theorem 4.3, then

$$\begin{aligned} \tilde{d}(P^*_{nt}\delta _z, P^*_{nt}\pi )\le \frac{1}{2}\tilde{d}(P^*_{(n-1)t}\delta _z,P^*_{(n-1)t}\pi )\le \cdots \le \frac{1}{2^n}\tilde{d}(\delta _z,\pi ). \end{aligned}$$

By Lemma 4.5, for some constant \(C_0\)

$$\begin{aligned} \tilde{d}(\delta _z,\pi )&\le \int \pi (dz') \sqrt{(1+V(z)+V(z'))(2M)^r\delta ^{-1}|x-x'|^rV(z)^r} \\&\le C_0+C_0 V(z)^{\frac{r+1}{2}}\int \pi (dz')|x-x'|^{\frac{r}{2}}+C_0 V(z)^{\frac{r}{2}}\int \pi (dz')|x-x'|^{\frac{r}{2}}V(z')^{\frac{1}{2}}. \end{aligned}$$

Since \(V(z')\ge M^{-1}|x|^\frac{1}{n}\), using the bound above, we can pick \(r<\frac{2}{n}\) in the beginning such that

$$\begin{aligned} \tilde{d}(\delta _z,\pi )\le C_0(1+2V(z)\mathbb {E}^\pi V). \end{aligned}$$

On the other hand, by Lemma 4.5 and that \(1\wedge u^r\ge 1\wedge u\) for any \(u\ge 0\),

$$\begin{aligned} \tilde{d}(z,z')\ge 1_{y=y'}\wedge \frac{|x-x'|^r}{(2M)^r}V(z)^r+1_{y\ne y'} \ge 1_{y=y'}\wedge \frac{|x-x'|}{2M}+1_{y\ne y'}\ge \frac{1}{2M} m(z,z'). \end{aligned}$$

Combining the inequalities above, we have our claim. \(\square \)

5 Conclusion and discussion

Diffusions with random switching are a general class of stochastic processes with applications in many different areas. Such system consists of a diffusion process part \(X_t\) and a Markov jump part \(Y_t\), and their dynamics can be fully coupled. The stability and ergodicity of such processes remain open questions if the transition rates are not bounded or Lipschitz. This paper closes this gap by developing a new analytical framework.

The first part of this paper constructs polynomial-type Lyapunov functions when there is dissipation on average. These functions can be used to derive moment bounds on the diffusion part. The incorporation of potential dissipation of each regime is found to be an efficient way to capture averaged dissipation. This idea can be easily applied to the classical case where the transition rates are constants as Theorem 2.2. It also leads to a simple illustration of comparison principles, Theorem 2.9. Moreover, with a Fredhlom alternative argument, we demonstrate how can the Lyapunov function be inductively constructed, as a dual process of the averaging procedure, Theorems 2.4, 2.7, assuming the transition rates have a multiscale structure.

The second part of this paper is devoted to the geometric ergodicity of diffusions with random switching, assuming a Lyapunov function exists. If there is a commonly accessible regime that satisfies the minorization condition, Theorem 3.4 proves geometric ergodicity under total variation distance. When there is contraction on average, using the asymptotic coupling framework of [17, 19], Theorem 3.6 demonstrates geometric ergodicity under a proper Wasserstein distance.

There are a few interesting ideas came to the authors when this paper is written. We share them in below to inspire further research.

  1. 1.

    The authors conjecture that the results here hold in a similar form if the regime process \(Y_t\) is instead a continuous stochastic process. One way to see this is taking the limit of a jump process on grid points of vanishing size. But the authors suspect that an independent mechanism can be set up for these processes without little change of the proofs. Such theory will be applicable to many nonlinear models that exhibits intermittency, for example [34].

  2. 2.

    The attraction or contraction rates used in this paper provides a uniform control over each component of \(X_t\). A more general situation is that each component of \(X_t\) has a different regime-based attraction rates; in other words, the attraction rate is given by a matrix. A simple example will be \(\hbox {d}X_t=A(Y_t)X_t \hbox {d}t\), where A is a matrix-valued function. It is known such system is very sensitive to the switching even if A has all eigenvalues of negative real parts [28]. It will be interesting if our results can be generalized to this case.

  3. 3.

    Theorem 3.4 can probably be generalized. Bakhtin and Hurth [1] show that a PDMP is geometric ergodic in total variation despite the fact that each regime is degenerate. The proofs there only require a Hörmander type of condition on the vector fields generated by different regimes. The authors conjecture that for diffusions with random regime switching, in order to have geometric ergodicity, it suffices to have the Lie algebra generated by stochastic flows of all regimes spanning \(\mathcal {R}^d\). But this requires a completely different set of techniques, and Assumption 3.2 should be general enough to cover most applications.