Random Forests and Networks Analysis

D. Wilson~\cite{[Wi]} in the 1990's described a simple and efficient algorithm based on loop-erased random walks to sample uniform spanning trees and more generally weighted trees or forests spanning a given graph. This algorithm provides a powerful tool in analyzing structures on networks and along this line of thinking, in recent works~\cite{AG1,AG2,ACGM1,ACGM2} we focused on applications of spanning rooted forests on finite graphs. The resulting main conclusions are reviewed in this paper by collecting related theorems, algorithms, heuristics and numerical experiments. A first foundational part on determinantal structures and efficient sampling procedures is followed by four main applications: 1) a random-walk-based notion of well-distributed points in a graph 2) how to describe metastable dynamics in finite settings by means of Markov intertwining dualities 3) coarse graining schemes for networks and associated processes 4) wavelets-like pyramidal algorithms for graph signals.


Introduction: networks, trees and forests
The aim of this paper is to survey some recent results [4,5,2,3] on a certain measure on spanning forests of a given graph and its applications within the context of networks analysis. We call a network on n ∈ N vertices a directed and weighted graph where V denotes a finite vertex set of size |V| = n, E stands for a directed edge set seen as a prescribed collection of ordered pairs of vertices {(x, y) ∈ V × V}, and w : V × V → R + is a weight function, which associates to each ordered pair (x, y) ∈ E a strictly positive weight w(x, y). We will consider irreducible networks where for every two distinct vertices x, y ∈ V, there is always a directed path connecting them, that is, a sequence {e i = (x i , y i )} l i=1 ⊂ E for some l ∈ N such that x 1 = x, y l = y and y i = x i+1 for every i ≤ l − 1. Let us introduce the measure at the core of this work. A rooted spanning forest φ is a subgraph of G without cycle, with V as set of vertices and such that, for each x ∈ V, there is at most one y ∈ V such that (x, y) is an edge of φ. The root set R(φ) of the forest φ is the set of points x ∈ V for which there is no edge (x, y) in φ; the connected components of φ are trees, each of them having edges that are oriented towards its own root. We call F the set of all rooted spanning forests and we see each element φ in F as a subset of E. See Figure 1. For fixed positive q ∈ R + , we are interested in the random forest Φ q defined as the random variable with values in F with law: where w(φ) = e∈φ w(e) is the weight associated to the forest φ ∈ F, |R(φ)| is the number of trees, which is also the number of roots, and Z(q) = φ∈F w(φ)q |R(φ)| is the normalizing partition function. In particular ∅ ∈ F is the spanning forest made of n degenerate trees reduced to simple roots and w(∅) = 1. We can include the case q = +∞ in our definition by setting Φ ∞ = ∅ ∈ F in a deterministic way. In the sequel we denote by E expectation w.r.t. the random forest law P.
Let us notice that the random forest Φ q induces a partition of the graph into trees, and hence the measure in (1) can be seen on the one hand as a clustering measure similar in spirit to the well-known FK-percolation [20]. On the other hand, the forest Φ q is rooted and the set of roots R(Φ q ) forms an interesting random subset of vertices whose distribution can be explicitly characterized. Figure 1 and 2, respectively, show a realization of Φ q in a simple geometrical setting and how the associated partition denoted by P(Φ q ) and set of roots R(Φ q ) look like. As we will show, the presence of the tuning parameter q, controlling size and number of trees, and related efficient sampling algorithms make this measure particularly flexible and suitable for applications.
1.1. Content of the paper. This survey is organized in two parts. The first three sections constitute a first foundational part, followed by a second part on applications presented in the remaining sections. We will start by presenting basic properties of the measure in (1), section 2, and some sampling algorithmic counterparts, section 3. We then move to three main applications: (I) In section 4, we will show how the set of roots R(Φ q ) can be used to define a probabilistic notion of subset of well distributed points in a graph and to practically sample it. (II) As a second application, a network coarsening scheme is presented in section 5, based on the forest Φ q and the notion of Markov chain intertwining. Motivations stem from questions in metastability and in signal processing and we provide two different algorithms and related theorems to control the quality of the resulting coarse grained network. (III) As a third application, supported by theorems and related experiments, in section 6 we propose a wavelet basis construction and a multiscale algorithm for signal processing on arbitrary networks.

Figure 2.
A sample of P(Φ q ) and R(Φ q ) on a 2-dimensional box of Z 2 with constant unitary nearest-neigbour weights and periodic boundary conditions. We used different shades of blue for blocks in the partition identified by different trees. Leaves are cyan, and roots are at the centers of red diamonds.
To conclude this introductory part, we describe briefly the origin of this measure and related literature, section 1.2, and introduce some basic crucial objects and notation needed along the paper, section 1.3. Apart from Wilson's algorithm, Theorem 1, Propositions 8 and 13, and parts of Theorems 2 and 11, all other statements, algorithms and experiments are original and they were recently derived by the authors.
1.2. Uniform spanning tree and a zoo of random combinatorial models. The Uniform Spanning Tree (UST) constitutes a by now classical topic in probability theory and it consists of a random spanning tree sampled uniformly among all possible spanning trees of a given graph. The analysis of this object can be traced back at least to the work of Kirchhoff [25] where the number of spanning trees of a graph was characterized in terms of determinants of minors of the corresponding discrete Laplacian matrix (matrix-tree theorem). In the last decades, UST has been playing a central role in probability and statistical mechanics due to its deep relation with Markov chain theory and its surprising connection to a number of challenging random combinatorial objects of current interest: e.g. loop erased random walks, percolation, dimers, sandpile models, Gaussian fields. We refer to [6,20,26,28,27,22,8,17,24,23] for an overview on the vast literature on the subject. What makes this object particularly interesting is its determinantal nature, namely, related local statistics have a closed-form expression in terms of determinants of certain kernels, together with the fact that there's an efficient random-walk-based-algorithm due to Wilson [42] for sampling, which will be presented in section 3.1. The forest measure in (1) is actually a simple variant of the UST measure and it appeared already as a remark in [42] where the author mentions how to sample it. As for the UST measure, there are many interesting questions related to scaling and infinite volume limits of observables of the forest measure in (1). Nonetheless, our focus here is on applications within the context of networks analysis and for this reason we will not insist on this very interesting fundamental line of investigation and we will restrict only to networks with finite number of vertices.
1.3. Basic objects and notation: random walks, graph Laplacian, Green's function. Given the finite, irreducible, directed and weighted graph G = (V, E, w) on n vertices, let X = {X(t) : t ≥ 0} be the irreducible continuous-time Markov process with state space V and generator L given by In view of the finiteness and irreducibility assumptions, there exists a unique invariant measure for the Markov process X which will be denoted by µ. Averages of functions w.r.t. µ will be denoted by µ(f ). We recall that the invariance of µ is equivalent to µ(Lf ) = 0 for arbitrary functions f . We will denote by P x and E x , respectively, law and expectation w.r.t. the random walk X starting from x ∈ V. Note that L acts on functions as the matrix, still denoted by L: We refer to the operator −L as the weighted graph Laplacian and we set We will denote byX the discrete-time skeleton chain associated to L defined as the Markov chain with state space V and transition matrix with Id V being the identity matrix of size |V|.
In the sequel we deal with restrictions of various matrices: for any A, B ⊂ V and any matrix In case A = B, we will simply write [M ] A . For a subset A ⊂ V T A = inf{t ≥ 0 : X(t) ∈ A} is the hitting time of the set A, with the convention T ∅ = ∞. Finally, for a given (possibly random) time T and arbitrary x, y ∈ V, we write : i.e. the mean time X spends in y up to time T when starting from x. In case T is a random variable independent of the random walk, we will slightly abuse notation and still use E x for expectation w.r.t. the random walk and the extra randomness. Let us conclude with basic notation for normed spaces. We will denote by Further, for arbitrary probability measures ν 1 and ν 2 on V, their total variation distance is given by 2. Random spanning forests: Laplacian spectrum and determinantality We start here an account of the basic fundamental results characterizing the distribution of the main objects related to the random forest Φ q . It will be convenient for the sequel to consider the following generalized version of the random forest. For any B ⊂ V we denote by Φ q,B a random variable in F with the law of Φ q conditioned on the event B ⊂ R(Φ q ) . We then have, for any φ in F, with The original definition in Equation (1) is recovered by simply setting B = ∅, so that Φ q = Φ q,∅ and Z(q) = Z ∅ (q). This extended law is non-degenerate even for q = 0, provided that B is non-empty. And if B is a singleton {r}, then Φ 0,{r} is the classical random spanning tree with a given root r, namely, a spanning tree τ rooted at r sampled with probability proportional to e∈τ w(e). Let us emphasize that actually, for q > 0, Φ q = Φ q,∅ itself is also a special case of the usual random spanning tree on the extended weighted graph G = (V , E , w ) obtained by addition of an extra point r to V -to form V = V ∪ {r}-and by adding extra edges to r with weights w (x, r) = q and w (r, x) = 0 for all x in V. Indeed, to get Φ q from the random spanning tree on V rooted in r, one only needs to remove all the edges going from V to r.
As a first result we characterize the partition function in terms of the Laplacian spectrum. To this end we denote by λ 0,B , λ 1,B , . . . , λ l−1,B , with l = |V \ B| and some B ⊂ V, the eigenvalues of [−L] V\B .
Theorem 1 (Partition function and Laplacian spectrum). For any B ⊂ V, Z B is the characteristic polynomial of [L] V\B , i.e., The above result can be seen as a version of the well-known matrix-tree-theorem (see for instance [1]). The proof can be derived in several ways. We refer to [4] for an elementary proof by a classical argument based on loop-erased random walks and using the notation herein.
As mentioned above, one of the nice features of the forest measure (as well as for the random spanning tree measure) is its determinantal structure which allows for explicit computations. We start by recalling the determinantality of the edge set for which we need some more notation.
For an oriented edge e = (x, y) we denote the starting and ending vertex, respectively, as e − = x and e + = y. For any given B ⊂ V and q > 0, we write G q,B for the Green's function in Equation (7) with T = T q ∧ T B , the minimum between the hitting time of B (see Equation (6)) and time T q denoting an independent exponential time of parameter q. It is not difficult to see that G q,B can be identified with the operator [qId − L] −1 V\B so that, For x ∈ V and e in E, we call J + q,B (x, e) = G q,B (x, e − )w(e) the expected number of crossings of the (oriented) edge e up to time T q ∧ T B , and , the net flow through e starting from x. Theorem 2 (Determinantal edges: transfer-current). Fix B ⊂ V and q > 0. Then, for any A k = {e 1 , . . . , e k } ⊂ E : In addition, denoting by {±e 1 , . . . , ±e k ∈ Φ q,B } the event that for all i ≤ k either e i or −e i belong to Φ q,B , it holds with The above theorem is a version of the celebrated transfer-current theorem due to Burton and Pemantle [8]. In its original form, this theorem was proven in an undirected graph, extensions to the directed setup have appeared for e.g. in the recent Chang [10] and the statement in Theorem 2 is nothing but a probabilistic reformulation of Thm. 5.2.3 and Coro. 5.2.4 in [10]. For a simple proof using our notation, we refer the reader to Prop. 3.1 in [5].
Theorem 2 says that Φ q,B is a determinantal process with kernel I + q,B interpretable in terms of random-walk-flow. If from a computational point of view, this allows to get explicit formulas, from a phenomenological perspective, being determinantal means that the corresponding objects tend to repel each other, more precisely, they are negatively correlated: for any e 1 , e 2 ∈ E. Inherited by the determinantal nature of Φ q,B , also the set of roots R(Φ q,B ) is determinantal with a remarkable stochastic kernel given by the random walk X killed at time T q ∧ T B : Theorem 3 (Determinantal roots with killed random walk kernel). Fix B ⊂ V and q > 0. Then, for any A ⊂ V : K q,B (x, y) := qG q,B (x, y) = P x X(T q ∧ T B ) = y , x, y ∈ V . In case B = ∅, Φ q,∅ = Φ q and we simply write This theorem has been derived in [4], see Prop. 2.2 therein. We next move to the characterization of |R(Φ q,B )|, that is, the number of roots/connected components/trees. The next statement corresponds to Prop. 2.1 in [4]. Decompose and define independent random variables B j 's and C j 's, respectively, with laws: , j ∈ J 0 , and for j ∈ J + , P(C j = 2) = |p j (q)| 2 , P(C j = 1) = 2Re (p j (q))−2|p j (q)| 2 , P(C j = 0) = 1−2Re (p j (q))+|p j (q)| 2 .
Then, whenever q > 0 or B = ∅, the random variable |R(Φ q,B )| is distributed as Notice that in case the spectrum of the graph Laplacian −L is real and B is empty, |R(Φ q )| is simply given by the sum of independent Bernoulli's B j 's (J + being empty). In particular, since λ 0 = 0, B 0 = 1 and we recover the fact that |R(Φ q )| ≥ 1 a.s.
Further, we emphasize that in general momenta of the |R(Φ q )| have simple expressions and can be easily obtained by differentiating w.r.t. q the normalizing partition function Z(q). For example, mean and variance are given by 2.1. Dynamics: forests, roots and partitions. Before moving to sampling algorithms, it is worth mentioning that it is possible to construct a stochastic process with values in F which allows to couple at once all Φ q 's as q varies in R + . A few comments on this coupling are postponed to section 3.3 and Figure 5. We state here the main theoretical results and collect some related remarks. The following statement corresponds to Thm. 2 in [4].
The coupling t → Φ 1/t = F (ln(1 + w max t)) is associated with a fragmentation and coalescence process, for which coalescence is strongly predominant, and at each jump time one component of the partition is fragmented into pieces that possibly coalesce with the other components. In particular, the process F starts from F (0) = Φ ∞ = ∅ ∈ F, that is, the degenerate spanning forest made of n trees reduced to simple roots, and eventually reaches in finite time a unique spanning rooted tree.
As a corollary of the above coupling theorem, we get a determinantal characterization of the finite-dimensional distributions of the process t → R(Φ 1/t ), which can be seen as a dynamical extension of Theorem 3.
Corollary 6 (Dynamic roots distribution). For any choice 0 < t 1 < · · · < t k < t k+1 = 1/q k+1 and any sequence A 1 , . . . , A k , A k+1 of subsets of V, it holds This statement corresponds to Prop. 2.4 in [4]. We do not have a similar characterization for the partition process t → P(Φ 1/t ). More generally, as we saw, while a precise understanding and characterization of Φ q and R(Φ q ) is possible, we know very little about the induced partition P(Φ q ).
We conclude this part on the relevant theoretical results by mentioning a last property of the root set, namely, by conditioning on the induced partition P(Φ q ) the roots are distributed according to the equilibrium measure µ of the random walk X restricted to each component of the partition: Theorem 7 (Roots at restricted equilibria). Let [A 1 , . . . , A m ] be denoting an arbitrary partition of V in m ≤ n subsets and fix r i ∈ A i , i = 1, . . . , m. Then provided that the conditioning has non-zero probability, with µ A i denoting the invariant measure of the restricted dynamics with generator L i defined by This statement is a consequence of the well-known Markov chain tree theorem (cf. e.g. [1]), see Prop. 2.3 in [4].

Sampling algorithms: Wilson's & Co.
The flourishing literature around the random spanning tree theme is mainly due to Wilson's algorithm (cf. [42]), which is not only a practical procedure to sample Φ q,B , but actually also a powerful tool to analyze its law. The reader not acquainted with this topic is invited to look into the proofs of the results presented in section 2 which heavily exploit the power of this algorithm in action. We will start by recalling it, section 3.1. We then explain how to get an approximate sample of a forest with a prescribed number of roots, section 3.2, and we conclude this sampling algorithmic part with some comments about sampling the forest dynamics in Theorem 5, section 3.3.

3.1.
Wilson's algorithm: sampling a forest for fixed q. The following algorithm due to Wilson [42] samples Φ q,B for q > 0 or B = ∅: a. start from B 0 = B and φ 0 = ∅, choose x in V \ B 0 and set i = 0; b. run the Markov process starting at x up to time T q ∧T B i with T q an independent exponential random variable with parameter q (so that T q = +∞ if q = 0) and T B i the hitting time of It is worth stressing that in steps a. and d. the choice of the starting points x is arbitrary, a remarkable fact which represents the main strength of this algorithm.
There are at least two ways to prove that this algorithm indeed samples Φ q,B with the desired law. One option is to follow Wilson's original proof in [42], which makes use of the so-called Diaconis-Fulton stack representation of Markov chains, cf. [13]. An alternative option is to follow Marchal who first computes in [30] the law of the loop erased trajectory Γ x q,B obtained from the random trajectory X : [0, T q ∧ T B ] → V started at x ∈ V \ B and stopped in B or at an exponential time T q if T q is smaller than the hitting time T B . One has indeed: From this result one can easily compute the law of Φ q,B following the steps of the algorithm above to get the law in Equation (11). Further, from this statement we see how determinants of the Laplacian emerge. Concerning the average running time of Wilson's algorithm, it is in general polynomial in the number of nodes n and typically much smaller than the random walk cover time. In particular, it can be explicitly characterized in spectral terms as the sum of the inverse of the n − |B| eigenvalues of [−L] V\B , see e.g. Prop. 1 in [30].

3.2.
Forests with a prescribed number of roots: approximate sampling. We have seen that Wilson's algorithm provides a practical way to sample Φ q . In applications, one might be interested in sampling Φ q conditioned on having a prescribed number of roots, that is, conditioned on |R(Φ q )| = m for fixed m < n. Unfortunately, we do not know any efficient algorithm providing such an outcome. Nevertheless we can exploit Theorem 4 to get a procedure to sample Φ q with approximately m roots, with an error of order √ m at most. In fact, it is not difficult to check that Var |R(Φ q )| ≤ 2E |R(Φ q )| and, in view of (18), it suffices to find the solution q * of the equation However, solving Equation (20) requires to compute the eigenvalues λ j 's of −L which is in general computationally costly especially if we are dealing with a big size network. One way to find an approximate value of the solution q * is to use, on the one hand, the fact that q * is the only one stable attractor of the recursive sequence defined by and on the other hand, the fact that |R(Φ q )| and E |R(Φ q )| are typically of the same order, at We refer the reader to section 2.2 in [4] to argue that indeed this algorithm rapidly stops.
3.3. Coalescence-fragmentation process: sampling for different q's at once. The Markov process F in Theorem 5 is based on the construction of a coalescence-fragmentation process with values in F making use of Diaconis-Fulton's stack representation of random walks. For a detailed account on this algorithm and a number of related open questions, we refer the reader to section 2.3 in [4].
We mention that this algorithm allows to couple forests for different values of q's. The corresponding coupling is not monotone, in the sense that if q < q, it is not true that |R(Φ q )| ≤ |R(Φ q )| a.s. under the coupling measure, despite the fact that E |R(Φ q )| < E |R(Φ q )| , see e.g. Equation (18). Yet this coupling is a very valuable tool in applications. In fact, it allows to practically sample Φ q starting from a sampled Φ q for any q < q, and more generally, by running this algorithm once in a chosen interval [0, t * ], we get samples of the whole forest trajectory (Φ 1/t : t ≤ t * ).

Applications: Well distributed points in a network
Given a map of a city modeled as a network G where road crossings are identified with vertices, assume that we are interested in locating a number of energy plants at some crossings in such a way that the energy flowing out of these plants takes on average the same amount of time to reach every vertex of the city. If the energy flows according to a random walk X, we can rephrase the above problem by finding a subset B ⊂ V for the locations of the energy plants such that for any x ∈ V: In other words, B ⊂ V would constitute a set of well distributed points in the network . We immediately realize that unless we have as many energy plants as the number of crossings, B = V, there is no deterministic proper subset R satisfying the property in (21). Hence this notion of well distributed points is in principle meaningless but, by thinking in terms of disorder , we can turn it into a well-posed definition by finding a random set satisfying (21) in an averaged sense. That is, a random set B(ω) ⊂ V is chosen according to some law with expectation E so that It then raises the question: does it exist such a random subset? Our set of roots R(Φ q ) might be a good candidate since as previously noticed, the determinantality of the roots set stated in Theorem 3 implies negative correlations suggesting that the roots in Φ q tend to spread far apart from each other irrespective of the given network structure. It turns out that indeed R(Φ q ) gives a positive answer to this question for any q. Further, as the next statement shows, the same is true when conditioning on having exactly m energy plants, that is, with random subsets of prescribed size: where a n+1 = 0, and a k denotes the coefficient of order k of the characteristic polynomial of L.
This statement corresponds to Thm. 1 in [4]. When conditioning in Equation (23) with either m = 1 or m = n − 1, it turns out that the property in (22) actually characterizes the law of B(ω). In particular, for m = 1, the Markov tree-theorem, see e.g. [1], ensures that conditionnally on {|R(Φ q )| = 1}, R(Φ q ) coincides with a point sampled according to the equilibrium measure µ of X. In this case, Equation (23) was already known in the literature and often referred to as the random target lemma, cf. e.g. Lemma 10.8 in [27]. Our theorem is therefore a natural extension of this random target to subsets of arbitrary sizes. To get some insights in the corresponding proof, it is worth mentioning that the expected hitting time of a given deterministic set B in Equation (21) admits the following characterization due to Freidlin and Wentzell in terms of forests, cf. Lemma 3.1 in [4] and Lemma 3.3 in [18]: with G B denoting the Green's function in Equation (7) stopped at the hitting time T = T B , and τ x (φ) standing for the tree in φ ∈ F containing x ∈ V.
Concrete applications of this result will be given in the next sections to build up suitable subsampling procedures in the context of networks reduction and signal processing.

Applications: Network coarse graining via intertwining
The goal of this section is to propose a random network coarse graining procedure which exploits the rich and flexible structure of the random spanning forest. The problem can be formulated as follows: (P1) Given a network G on n nodes, find a smaller networkḠ on m < n nodes "mimicking" the original network G.
For the sake of simplicity, we will restrict ourselves to networks in a reversible setup, that is, when µ(x)w(x, y) = µ(y)w(y, x) for any x, y ∈ V, with µ being the invariant measure of the Markov process X. Now, it is a priori not clear what "mimicking" means and any meaningful answer would strongly depend on the implemented method and on the specific applications one has in mind. Our approach will be process-driven. Namely, we saw that the structure of the starting network G is encoded in the weighted graph Laplacian in Equation (2) which characterizes the Markov process X with state space V. In view of the one-to-one correspondence between G and the process X, we can look for aḠ in correspondence to another processX on a state spaceV with m = |V| < |V| = n points being some sort of coarse grained version of the process X. In this way, we are shifting perspective from graphs to Markov processes and, within this context, there is an interesting duality notion called Markov chain intertwining which will be our lighthouse while addressing (P1). Before discussing this duality, we anticipate that concerning the applications, our main motivation stems from two different problems: • metastability studies outside any asymptotic regimes (like low temperature or large volume limits), • signal processing on arbitrary networks. We will hence propose in sections 5.2.2 and 5.3 two different answers to (P1), namely, two network reduction procedures of general interest based on intertwining and random forests, inspired by and well-suited for the above problems. Our reduction schemes might also be useful for other applications, hence we will start by discussing where the difficulties lie when looking at network reduction problems through intertwining equations. We start by introducing and discussing the intertwining duality. 5.1. Intertwining and squeezing. Given two Markov chains with transition matrices P and P , and state spaces V andV, and given a rectangular stochastic matrix Λ :V × V → [0, 1], we say that the two chains are intertwined w.r.t. Λ if : Denoting by {νx = Λ(x, ·) :x ∈V} the family of probability measures on V identified by Λ, we see that this algebraic relation among matrices can be rewritten as which says that: the one-step-evolution of the νx's according to P remains in their convex hull. This duality notion can be equivalently formulated for continuous-time Markov processes by saying that two Markov processes X andX with generators L andL and state spaces V andV are intertwined w.
By associating to the Markov process X with generator L in (2) the discrete-time skeleton chainX as in (5), we see that (26) is equivalent to (24) ifP =L wmax + IdV , and Equation (25) reads as : which says again that, for eachx ∈V, by evolving the distribution νx according to L, the process under consideration, with rateL(x,ȳ) is distributed according to νȳ.

5.1.1.
Intertwining in the literature. Intertwining relations appeared in the context of diffusion processes in a paper by Rogers and Pitman [36] as a tool to state identities in laws when the measures νx's have disjoint supports. This method was later successfully applied to many other examples (see for instance [9], [31]). In the context of Markov chains, intertwining was used by Diaconis and Fill [12] without the disjoint support restriction to build strong stationary times and to control convergence rates to equilibrium. At the time being, applications of intertwining include random matrices [14], particle systems [41]...

5.1.2.
Solutions to intertwining equations, overlap and Heisenberg principle. In the above references, intertwining relations have often been considered with |V| being (much) larger than or equal to |V|. To address (P1), we will instead be naturally interested in the complementary case |V| < |V| and with the coarse grained process (or network identified by)P being irreducible. In this setup it is not difficult to show the existence of solutions (Λ,P ) to Equation (24), how they are related to the spectrum of P , and how to construct some of them. We refer the interested reader to section 2.2 in [2] and the statements therein. Let us simply note here • that the intertwining equations generally have many solutions, including the trivial ones when, for anyP , all the νx are equal to the invariant measure µ of P , • that the stability of the convex hull of the νx implies the stability of the vector space they span and the fact that the νx have to be linear combinations of at most |V| left eigenvectors of P . Looking at the eigenvectors of the generator L as the analogue of the usual Fourier basis, which diagonalizes the Laplacian operator, we will say that the solutions νx of the intertwining equations have to be frequency localized. But we will see that the solutions we will be interested in for building a coarse-grained network should also be "space localized" or "little overlapping". We will soon make precise what is meant here. Let us simply stress for now that having both frequency and space localization would contradict a (unfortunately not well established for arbitrary graphs) Heisenberg principle. To overcome this difficulty we will look at approximate solutions of intertwining equations and we will focus on their squeezing, which is a measure of their joint overlap or space localization, that we can now introduce. 5.1.3. The squeezing functional. On the space of probability measures on V with |V| = n, let us denote by ·, · the scalar product defined as for arbitrary probability measures ν 1 , ν 2 on V, and let · be the corresponding norm. Given a family {νx = Λ(x, ·) :x ∈V}, since these measures form acute angles between them ( νx, νȳ ≥ 0 for allx andȳ inV) and have disjoint supports if and only if they are orthogonal, one could use the volume of the parallelepiped they form to measure their "joint overlap". The square of this volume is given by the determinant of the Gram matrix: with Γ the square matrix onV with entries Γ(x,ȳ) = νx, νȳ , that is where D(1/µ) is the diagonal matrix with entries given by (1/µ(x), x ∈ V), and Λ t is the transpose of Λ. Loosely speaking, the less overlap, the largest the volume. We will instead use the squeezing of Λ, that we defined by to measure this "joint overlap". We call it "squeezing" not only because the νx and the parallelepiped they form is squeezed when S(Λ) is large, but also because S(Λ) is half the diameter of the rectangular parallelepiped that circumscribes the ellipsoid defined by the Gram matrix Γ : this ellipsoid is squeezed too when S(Λ) is large. We note finally that our squeezing controls the volume of Λ. Indeed, by the comparison between harmonic and geometric mean applied to the eigenvalue of the Gram matrix, small squeezing implies large volume: Vol(Λ) 1/n S(Λ) ≥ √ n. The next statement, corresponding to Prop. 1 in [2], gives bounds on this squeezing functional suitable for our approach.
Proposition 10 (Bounds on the squeezing functional). Let {νx :x ∈V} be a collection of m = |V| probability measures on V.
• We have Equality holds if and only if the νx's are pairwise orthogonal.
• Assume that µ is a convex combination of the {νx :x ∈V}. Then, Equality holds if and only if the νx's are pairwise orthogonal.
We notice in particular that S(Λ) is maximal when the measures {νx :x ∈V} are linearly dependent, and minimal when they are orthogonal. Moreover, we know the minimal value of S(Λ), when µ is a convex combination of the νx's. Note that this is necessarily the case if the convex hull of the νx is stable under P , i.e. when (24) holds for some stochasticP . Indeed it is then stable under e tL for any t > 0 and the rows of Λe tL converge to µ when t goes to infinity. We are now in shape to move to the applications.

5.2.
Intertwining and metastability without asymptotics. A classical problem in metastability studies can be described as follows. Associated with Markovian models one is interested in making a coarse grained picture of a dynamics which evolves on a large, possibly very large, configuration space, according to some generator L. A metastable state can be thought as a stationary distribution on this large configuration space up to some random exponential time that triggers a transition to a different metastable state, possibly a more stable one. By different metastable states we mean "concentrated on different parts of the large configuration space." This is usually addressed in some asymptotic regime such as low temperature or large volume limits. A natural and mathematically rigorous way (cf. Theorem 11 below) to perform such a coarse graining, and avoiding limiting procedures, would be to provide solutions (Λ,L) to (26) Figure 4) we consider a Metropolis-Glauber kinetic Ising model for a spin system started from aligned minus spins (yellow pixels on the pictures, red pixels standing for plus spins) and evolving under a small magnetic field h = 0.14, at subcritical temperature T = 1.5, in a n × n rectangular box B n with periodic boundary conditions and n = 256 (thus here V = {+, −} B 256 ). The first three pictures can be thought as samples of a metastable state, which is concentrated on the minus phase of the system, that is stationary up to the appearance (nucleation) of a supercritical droplet. This droplet triggers the relaxation to a stable state, which is concentrated on the plus phase, and samples of which are given by the last three pictures. In this case,L in (26) would be described by a 2 × 2 matrix. Since the nucleation time is "long" (in the simulation the time needed for the supercritical droplet to invade the whole box is short with respect to the time needed for its appearance), solving (26) with little overlapping νx would lead to a very smallL. A natural alternative approach would be to provide, with the same kind of link Λ, an intertwining relation between Markovian kernels (rather than generators) of the form: with K q (x, ·) as in (16). The measure K q (x, ·) is indeed the distribution of the original process, with generator L, started at a configuration x ∈ V and looked at an exponential time T q with parameter q . This parameter should be of the same order as the nucleation rate.
Unfortunately for this specific Ising example, we do not know how to write down solutions of such intertwining equations with Λ having small squeezing. However, while the main results in metastability studies are usually written in some asymptotic regime (e.g. for the Glauber dynamics illustrated in Figure 4: h 1 with an associated diverging box side length), we stress that looking at metastable dynamics through intertwining equations makes possible to deal with such dynamics outside of any asymptotic regime. This is made clear through the following statement which we state in discrete-time since (31) is as in (24) with P = K q . • νx is stationary up to time Tx, i.e., for all t ≥ 0, • P νx Ȳx =ȳ =P (x,ȳ) 1−P (x,x) for allȳ inV; • P νx X (Tx) = · Ȳx =ȳ = νȳ(·); • Ȳx ,X(Tx) and Tx are independent.
This statement corresponds to Prop. 6 in [2] and it can be seen as a partial rewriting of section 2.4 of [12] in the spirit of [33].

5.2.2.
A coarse-graining algorithm for metastability: from processes to measures. To make use in practice of the result in Theorem 11, as motivated so far, one should find explicit solutions (Λ,P ) to (31) with corresponding νx's having small joint overlap, which for relevant non-trivial examples could be too difficult if not unfeasible. We introduce here a deterministic algorithm depending on some tuning parameters to circumvent this problem. For a given Markov process X with a big state space (or equivalently the given associated network G), the goal is to build a measurevalued process on a small state space "mimicking" dynamical aspects of the original process in the sense of Theorem 11. We will accomplish our task if the generator of the resulting measure-valued process is close to be intertwined with the original one, and if the associated Λ has small squeezing. To guarantee these properties, we will then randomize the procedure through the random forests and give appropriate estimates for the tuning of the involved parameters.
Deterministic algorithm based on partitioning with µ A being the probability µ conditioned to A ⊂ V, i.e. µ A = µ(·|A); d. The new process is given bȳ for anyx,ȳ ∈V and with T q being an independent exponential random variable of parameter q > 0. A few remarks: • Proposition 10 guarantees that the linking measures in (33) have minimal squeezing S(Λ) equal to one. • Provided that this pair (Λ,P q ) is close to be a solution to the intertwining Equation (31), the networkḠ =Ḡ(q ) identified byP q can be seen as a reduced measure-valued description of the original network G on time-scale T q , hence a possible answer to our original problem (P1). In particular, by construction, inherited from X,P q is again irreducible and reversible w.r.t. µ. • For any choice of the parameter q > 0, in view of step d. and the irreducibility assumption, the resulting networkḠ will be given by a complete graph with non-homogenous weights identified by (34). It remains to clarify how to choose the initial partition P(G) in step a. above and how to guarantee that (Λ,P q ) is an approximate solution to the intertwining Equation (31). To this end, we can exploit the nice properties of the random forests in (1) and randomize the deterministic algorithm above.
Randomization through forests: for q > 0. The choice of the random partition P(Φ q ) is motivated by Theorems 7 and 9 and the fact that by Wilson's algorithms, two nodes x, y ∈ V tend to be in the same block of P(Φ q ) if on scale T q , the process X is likely to walk from x to y or vice-versa. The following theorem quantifies in which sense the pair (Λ,P q ) is an approximate solution to (31) and can serve as a guideline to tune the involved parameters (q, q ). Since for eachx ∈V, ΛK q (x, ·) andP q Λ(x, ·) are probability measures on V, we will use the total variation distance defined in (10).
Theorem 12 (Control on intertwining error for metastability). Let p ≥ 1, and p * its conjugate exponent, i.e., 1 p + 1 p * = 1. Fix positive parameters (q, q ), and let (Λ,P q ) be the pair given by (33) and (34) randomized through Φ q as in (35). Then, where |Γ x q | in the r.h.s. denotes the length (i.e. the number of crossed edges) of the trajectory of a loop-erased random walk on the original graph started from x ∈ V and stopped at an exponential time T q (recall Proposition 8).
The proof of the above statement together with some insights on how to tune (q, q ) to guarantee the bound in the r.h.s. to be small, can be found in [2], cf. Thm. 4 and related discussion therein. For the interested reader, Thm. 5 in [2] also says how to modify the intertwining matrix Λ in (33), and make it into an exact solution to (31) for small enough q . Figure 5 illustrates the coupled partitions associated with different values of q for a random walk in random potential. We consider a Metropolis nearest-neighbour random walk associated with Brownian sheet potential V and inverse temperature β on the square box [0, 511] 2 ∩ Z 2 . This means that the rates w(x, y) are given by w(x, y) = exp −β[V (y) − V (x)] + if x and y are nearest neighbours, and 0 if not. In this picture, the vertical and horizontal axes are oriented southward and eastward respectively, so that V is 0 on the left and top boundaries. Since the value of V at the bottom-right corner is of order 500 (it is the sum of 510 2 independent normal random variables), we already have a metastable situation for the case β = 0.16, illustrated in Figure 5: on the time scales 1/q corresponding to the different partitions P(Φ q ), the random walk tends to be trapped in each piece of the partition.
Let us finally observe that if such a coarse-graining scheme can be practically implemented on very large networks, the typical size of statistical mechanics configuration spaces is too large for these algorithms to be run in such cases. For the Ising example in section 5.2.1, we would need to deal with a network made of |V| = 2 65536 nodes. But it could be that a similar coarse-graining procedure applied on the graph on which the spin are defined (the 256 × 256 two-dimensional torus in our case), rather than on the configuration graph (with its 2 65536 nodes in our case), can be used to derive a coarse description of the original dynamics. We believe that such an approach would deserve some investigation. In section 5.3 we give another similar algorithm to address these slightly different questions. This other algorithm will be concretely motivated by applications in signal processing rather than in metastability. After describing it, we will turn back to our random walk on the Brownian sheet. 5.3. Network coarse-graining from a signal processing point of view. We present here a different network reduction scheme. Our original motivation will be fully explained in the next section 6 where we construct a pyramidal algorithm to process signals (i.e. real valued functions) defined on the vertex set of a network. In first approximation, the idea of these pyramidal algorithms is to build from a given network and associated signal, a multiscale invertible reduction scheme where at each scale, after so-called downsampling and filtering procedures, one can define a network of smaller size (typically having a fraction of the original nodes) with associated a coarse grained approximation of the signal. We again have to address problem (P1) from the beginning of this section, but the purpose is different. We will follow a similar strategy as for metastability, but looking at solutions (Λ,L) to Equation (26) under a different perspective. In this case, as intertwining rectangular matrix of size |V| × |V|, we use that is, the restriction of the kernel in (16). Actually, based on the concrete problem under investigation, other suitable kernels can also be used, this is just a convenient choice for our application explained in more details in section 6. We only anticipate that as for metastability, the measures identified by the Λ's rows should concentrate on different regions of the underlying state space. They will play the role of low frequency filters in the wavelets construction to build coarse-grained versions of the original signal through local averages. Anyhow, in this section, we focus only on the network reduction step, that is, on theL we want to consider, and how we can guarantee that it is almost intertwined with the given L. • The Schur complement is a classical electrical-network-like reduction, sometimes referred to as Kron's reduction, having nice properties and a clear probabilistic interpretation which we explain in Proposition 13 below. In particular, as for the algorithm in section 5.2.2, the resulting process is still irreducible and reversible, thus allowing for successive iterations of this coarse graining procedure (see example in Figure 6). • The resulting networkḠ tends to be (depending on local bottlenecks and the locations of the selected vertices in the original graph) a complete graph with non-homogenous weights, cf. Equation (38). Depending on the specific problem, one would wish to obtain a sparser graph (e.g. when iterating the scheme, dealing with sparse matrices can reduce the algorithmic complexity). To this aim, a possibility is to redistribute part of the masses in (38) and disconnect edges with low weights. An example of such a "sparsification" procedure will be briefly mentioned in section 5.3.2 below.
• Unlike the algorithm presented in section 5.2.2, for any q > 0, the measures identified by Λ q do not have disjoint supports. Thus quantitative bounds on their squeezing are desirable. • As in the case of metastability, the pair (Λ q ,L) identified by (36) and (37) is not a solution to (26), but can be turned into an approximate solution. In the following proposition, we collect relevant properties of the Schur complement and its probabilistic interpretation. Its proof can be found in [3], see Lemma 13 in there.
Proposition 13 (Schur complement and trace process). LetV be any subset of V and L as in (2) reversible w.r.t. µ. SetV = V \V. Then the restricted matrix [L]V is invertible, and the Schur complementL of [L]V in L is an irreducible Markov generator onV reversible w.r.t. µ. This Markov process with generatorL is often referred to as trace process on the setV. Further, the discrete-time kernel given byP = Id +L/w max admits the following interpretation: with T + V being the first return time of X inV. To sum up, we link anyx,ȳ ∈V with a weight (possibly zero) proportional to (38), that is, the probability that the original walk X starting fromx lands inȳ when hitting again the subsetV ⊂ V. We can next move to control squeezing and intertwining error for this choice of (Λ q ,L). In view of the well distributed property of the forests roots in Theorem 7, a natural way to select the reduced vertex set in step a. is to consider which, as for metastability, makes the pair (Λ q ,L) randomized w.r.t. Φ q . We next move to control squeezing and intertwining error in this case for which we use the p -norm in (9): Theorem 14 (Squeezing vs Intertwining). Fix q > 0. ConsiderV ⊂ V, and setV = V \V.
Then the deterministic pair (Λ q ,L) given by (36) and (37) satisfies: for any p ≥ 1, p * being its conjugate exponent, and any f in p (V, µ), where and P as in (5). (41) Further, by randomizing (Λ q ,L) as in Equation (39) for some q > 0, we get the squeezing control: and p j (·) as in (17) This result corresponds to part of Theorem 3 in [2] and Proposition 16 in [3]. Note that the upper bound on the squeezing depends on L through its spectrum only. For general remarks and insights on how to tune (q, q ) based on this result, we refer the reader to the comment after Theorem 3 in [2] and to section 6 in [3]. In particular, section 6.2 gives estimates on β defined in (41). We conclude this part on coarse-graining algorithms by giving a concrete example on a classical road network benchmark.

5.3.2.
An experiment: reduction of Minnesota road network. Figures 6 and 7 illustrate our recursive coarse-graining algorithm with and without sparsification procedure. We start from the Minnesota road network with unitary weights and choose our downsampling parameter as explained in section 6. Figure 6. Two successive reductions of the Minnesota's graph using the algorithm in section 5.3. The wider the edge, the larger its weight. Figure 6 shows that, as mentioned previously, our original sparse graph is rapidly changed into a dense one after a few Kron's reductions. The intertwining error, i.e., the approximation error in the intertwining equation, is then our guideline to build a sparsification procedure. We set down to 0 a family of small weightsw(x,ȳ) obtained by computing the Schur complementL, to get a sparser L s . These are chosen in such a way that for eachx the new error νx −L s Λ(x, ·) ∞ does not exceed the original error νx −LΛ(x, ·) ∞ by more than a fraction of it. We refer to section 7.2 in [3] for the details. Figure 7 illustrates the result.

5.3.3.
Back to the random walk in random potential in Figure 5. The very nature of metastable dynamics makes impossible in practice to run Wilson's algorithm to get a partition P(Φ q ) with a few pieces only, even for relatively small networks. The parameter q should be so small that we would get a huge running time of the algorithm. But the recursive procedure we described, with at each step a downsampling parameter q of the same order as the current maximal jumping rate w max , makes it possible to describe the long time dynamics through the computation of the trace process on a very small set of points. Figure 8 shows some of these reduced graphs for the random walk in Brownian potential on a smaller grid (64×64) but with larger inverse temperature β = 2.56, so that we face the same kind of difficulty. This reduction allows us to describe the dynamics up to convergence to equilibrium, which occurs on a time scale of order 10 33 . This is similar in spirit to the renormalization procedure introduced in [37]. The main difference here is that each vertex of the reduced graphs is associated through the linking matrices with a probability measure on the original graph. Our reduced process is a measure-valued process rather than a Markov process on some local minima of an energy landscape.

Applications: Intertwining Wavelets, multiresolution for graph signals
Weighted graphs provide a flexible representation of geometric structures of irregular domains, and they are now a commonly used tool to encode and analyze data sets in numerous disciplines including neurosciences, social sciences, biology, transport, communications... Edges between vertices represents interactions between them, and weights on edges quantify the strength of the interaction. In data modeling, it is often the case that a graph signal comes along with the network structure G, this is simply a real-valued function on the vertex set of the network: Functional magnetic resonance images measuring brain activity in distinct functional regions are classical example of such a graph signal. Signal processing is the discipline devoted to develop tools and theory to process and analyze signals. Depending on concrete instances, processing means e.g. classifying, removing noise, compressing or visualizing. In case of signals on regular domains very robust tools and algorithms have been developed over more than half a century mainly based on Fourier analysis. For signals on arbitrary networks studies are less advanced and only in recent years significant efforts from different communities have been dedicated to develop suitable efficient methods. We refer to [39] for a recent review on this growing investigation line. We present here a new and rather general multiresolution scheme we introduced in [3]. Multiresolution scheme is a generic name for several multiscale algorithms allowing to decompose and process signals. We will start with a quick recap of classical multiresolution schemes on regular domains. In section 6.2 we then present our new forest-algorithmic-scheme and in section 6.3 we collect the main theorems providing a solid theoretical framework to our method and guidelines for the practice. Illustrative numerical experiments of various nature are given in the concluding section 6.5.
6.1. Classical multiresolution: wavelets and pyramidal algorithms on regular grids. Let us consider a discrete periodic function f : Z n = Z/nZ → R, viewed as a vector in R n . The multiresolution analysis of f is based on wavelet analysis which roughly amounts to compute an "approximation"f ∈ R n/2 and a "detail" componentf ∈ R n/2 through classical operations in signal processing such as "filtering" and "downsampling". The idea is that the approximation gives the main trends present in f whereas the detail contains more refined information. This is done by splitting the frequency content of f into two components: the approximationf focuses on the low frequency part of f whereas the high frequencies in f are contained inf .
• Filtering is the operation allowing to perform such frequency splittings and it consists of computing a convolution f k for some well chosen kernel k: K l (f ) = f k l yields K l (f ) where • x belongs to the set of downsamples Z n isomorphic to Z/ n 2 Z. • {ϕ x , x ∈ Z n } is the set of functions such that the equality between linear forms < ϕ x , . >= K l (·)(x) holds for all x ∈ Z n . • In the same way, {ψ x , x ∈ Z n } is such that < ψ x , . >= K h (·)(x) holds for all x ∈ Z n . The choice of k l and k h is clearly crucial and done in a way that perfect reconstruction of f from f andf is possible with no loss of information in the representation [f ,f ]. By denoting f 0 = f , f 1 =f and g 1 =f , we see that this splitting scheme can be successively iterated starting from f 1 to obtained a sequence f N ∈ R n/2 N , g N ∈ R n/2 N , . . . , g 1 ∈ R n/2 , for any integer N where the total length of the concatenated vectors [f N , g N , . . . , g 1 ] is exactly n. This leads to the multiresolution scheme: We remark that the perfect reconstruction condition amounts to have {ϕ x , ψ x , x ∈ Z n } a basis for the signals f on Z n . A famous construction by Ingrid Daubechies [11] derives several families of orthonormal compactly supported such basis. It is worth mentioning that these families combine localization in space around the point x and localization properties in frequency due to the filtering step they have been built from. Using this space-frequency localization one can derive key properties of the wavelet analysis of a signal which rely on the deep links between the local regularity properties of f and the behavior and decay properties of detail coefficients. We refer the interested reader to one of the numerous books on wavelet methods and their applications such as [11] or [29], in all these methods the word wavelets denotes the family {ψ x , x ∈ Z n } spanning for the high-frequencies components.
6.2. Forest-multiresolution-scheme on arbitrary networks. When considering signals f on irregular networks G, it is not clear how to reproduce the classical multiresolution scheme described above. In other words, in a non-regular network, there are no canonical neither obvious answers to the following questions: (Q1) What kind of downsampling should one use? What is the meaning of "every other point"? (Q2) On which weighted graph should the approximation f 1 be defined to iterate the procedure? (Q3) Which kind of filters should one use? What is a good notion of low frequency components of a signal? In light of the properties and applications of the random spanning forests described in the previous sections, we do have natural answers to the first two questions. In fact, Theorem 9 suggests that the set of roots is a random downsample tailored to any network providing a possible answer to (Q1) 1 . And the network coarse-graining algorithm presented in section 5.3 is a good candidate to address (Q2). It remains to make sense of filtering in (Q3), that is, how to capture the low frequencies components or the main trend of a graph signal f . But if we use the algorithm in section 5.3, then it is natural to define the approximation componentf (x) as a local average around the downsampled x ∈V w.r.t. the measures identified by the intertwining matrix Λ = Λ q in (36). That is, for each downsampledx ∈V:f Before proceeding with our construction, let us give some remarks on one of the main problems in defining good filters in signal processing.
6.2.1. Filtering : fighting against Heisenberg. To be of any practical use in signal processing, the filters {νx,x ∈V} have to be well localized both in space and frequency. This is violating Heisenberg uncertainty principle, a delicate problem which needs a proper compromise. In the graph context, frequency localization means that the filters belong to an eigenspace of the graph laplacian L. Hence, we are interested in solutions to Equation (26) such that the measures {νx,x ∈V} (our proposed filters) are linearly independent measures tending to be non-overlapping (space localization), and contained in eigenspaces of L. We already observed that saying that (Λ,L) is an exact solution to (26) implies that the linear space spanned by the measures {νx,x ∈V} is stable by L, and is therefore a direct sum of eigenspaces of L, so that these measures provide filters which are frequency localized. Hence the error in the intertwining relation is a measure of frequency localization: the smaller the intertwining error, the better the frequency localization. And through Theorem 14 we can control such frequency localization.
Concerning space localization, we then want small squeezing on Λ. Notice further that our Λ in (36) is just the restriction of the Green's kernel K q which is very sensitive to localization by tuning the parameter q . In fact: • when q goes to 0, for anyx ∈V, K q (x, y) goes to µ(y) so that (26) is trivially satisfied.
Since µ is the left-eigenvector of L corresponding to the eigenvalue 0, the K q (x, y) are well localized in frequency. However, the vectors {K q (x, ·) |x ∈V} become linearly dependent and very badly localized in space. • On the other extreme, when q goes to ∞, K q (x, ·) goes to δx. Hence, the space localization is perfect. However, the frequency localization is lost, and the error in (26) tends to grow. A compromise has to be made for the choice of q . Since in our method there is also the parameter q controlling the fraction of downsampled points, we will need a suitable joint tuning of the pair (q, q ) to optimize localization properties. As explained in section 6.4, our tuning choice is indeed guided by Theorem 14 but also on stability results of the proposed method which we state in section 6.3 below. For a detail discussion on the actual choice of the parameters, we refer to section 6 in [3]. 6.2.2. Approximation, detail and the full forest-multiresolution. Let us summarize our proposed forest-multiresolution-scheme and present the corresponding basis construction. For arbitrary realfunctions f, g on V, we will denote by the scalar product w.r.t. µ. Intertwining Wavelets Given an irreducible and reversible network G = (V, w) on n vertices and a signal f : V → R.
a. Forest-downsampling: Choose a fix q > 0, letḠ =Ḡ(q) = (V,w) be the randomized coarse-grained (irreducible and reversible) network given by the algorithm in section 5.3.1 withV =V(q) as in (39), and setV = V \V. b. Forest-filtering: Fix q > 0 and let Λ = Λ q be as in (36). Define the approximation component of f as the functionf defined onV bȳ and the detail component of f as the functionf defined onV by Theorem 15 (Basis and wavelets). Fix a parameter q > 0. For eachx ∈V ⊂ V andx ∈V, respectively, define on V the densities functions of the measures Λ(x, ·) w.r.t. µ: and the functions and abbreviate {ξ x } x∈V = {(φx, ψx) |x ∈V,x ∈V}. Then the family {ξ x } x∈V is a basis of 2 (V, µ).
The statement above corresponds to Lemma 9 in [3]. As in classical multiresolution analysis, the functions {ψx,x ∈V} represent our wavelets. The basis functions given by {ξ x } x∈V are not pairwise orthogonal w.r.t. ·, · µ but, by considering the corresponding dual basis in 2 (V, µ), that is, the family {ξ x } x∈V defined through ξ x , ξ y µ = δ xy , x, y ∈ V, for any f ∈ 2 (V, µ), we get the following representation identifying our "split into low and high frequency" components. We call analysis operator U = U q the operator assigning to f ∈ 2 (V, µ) its coefficients in the expansion in (48). As explained in the following theorem, the reconstruction of f from the knowledge of its coefficients U (f ) can be made operative: Theorem 16 (Reconstruction formula). Fix q > 0. For any f ∈ 2 (V, µ), considerf ∈ 2 (V, µ) andf ∈ 2 (V, µ) respectively given bȳ This last statement correspond to Prop. 10 in [3] and fully describes our multiresolution scheme. In fact, in view of the properties ofḠ, we can simply iterate the procedure with (V,L, µV ) in place of (V, L, µ) resulting into a pyramidal algorithm as described in Equation (42).
Still, we have not enough motivated the choice of the filter bank given in (44) and (45) for our "high and low frequencies" components in relation to regularity properties of signals. To clarify this point, let us first emphasize that in any reasonable wavelets construction, as mentioned in section 6.2.1, good space and frequency localization properties are necessary ingredients. And in our approach, this issue is partially addressed via Theorem 14. Notice in particular that space localization is achieved using the determinantality ofV, and the fact thatV is well spread on V. SinceV = V \V is also a determinantal process, with kernel Id − K q , this suggested the detail definition (45) (the sign convention for the ψx is chosen to have a self-adjoint analysis operator µ)). On the other hand, another fundamental ingredient is that if the signal f is "regular enough", then the corresponding detail componentf should be small. For instance, if the original function is constant it should not contain any high frequency component, that is, the correspondingf being identically zero. As the last remark in Theorem 15 states, this is in fact the case in our intertwining wavelets. However, more generally, a way to capture and guarantee that the size of the details is small for non-constant but regular functions is desirable. In the next section we give bounds on the involved operators as a function of q and we make sense of this latter regularity issue through the norm of Lf in Theorems 19 and 21.
6.3. Quality and stability of intertwining wavelets. We collect here our results to guarantee numerical stability when using the forest-multiresolution-scheme and to guide in the choice of the parameters. We stress that the scheme presented in the previous section can be implemented when the downsampleV ⊂ V is chosen according to any (even deterministic) rule. For this reason, the following statements controlling norms and sizes of the main involved objects are stated for arbitraryV ⊂ V and will not depend on q. We start by the control on the approximation and detail operators in (50) and (50), respectively.
Theorem 17 (Bound on the norm of the approximation operator). Fix q > 0. LetV be any subset of V,V = V \V, and letR q be the operator defined in (50). For any p ≥ 1, for any f ∈ p (V, µV ), with w max as in (4),w max defined analogously w.r.t. the generatorL, and β as in (41).
Theorem 18 (Bound on the norm of the detail operator). Fix q > 0. LetV be any subset of V,V = V \V, and letȒ q be the operator defined in (51). For all p ≥ 1, for all f ∈ p (V, µV ), where p * is the conjugate exponent of p, β as in (41), and We refer to section 6.2 in [3] for estimates on γ, β andw max defined in (54). We now move to the regularity issue mentioned above, that is, for "regular" signal we wish small details. By measuring the modulus of continuity of the original function through Lf p,V , we get the following statement corresponding to Prop. 15 in [3]: Theorem 19 (Control on the size of details). For any p ≥ 1 and any f ∈ p (V, µ), The next result gives a control on the size of the coefficients at arbitrary scale i ≤ N when implementing the multiresolution scheme with N ≥ 1 successive reductions. To this end, let us introduce suitable abbreviations for the objects at the different scales i = 1, . . . , N . We have a given sequence of N (non-empty) nested vertex sets For each i = 0, . . . , N − 1 set: • w i , β i and γ i as in (4), (41) and (54), respectively, w.r.t. L i .
• f 0 = f , and the succesive approximation and detail components f i+1 = K i f i and g i+1 = (K i − IdV i )f i as in (44) and (45), respectively, so that We can now extend the analysis operator in (49) to arbitrary scale N ≥ 1 by setting Here is the control on the vectors identified by U N we derived in Prop. 17 in [3].
Theorem 20 (Bound on the norm of the analysis operator). Let p ≥ 1 and p * be its conjugate exponent. For any N ≥ 1 and f ∈ p (V, µ), Our last result is a form of so-called Jackson's inequality. This result is in general crucial for numerical stability in approximation theory and multiresolution analysis (and in particular it plays an important role in our approach to tune the involved parameters, see section 6 in [3]). It guarantees small error for "smooth" functions, when reconstructing an approximating function of the original one after setting the details g i 's to zero at all scales. This is clearly relevant if e.g. the aim of the multiresolution is to compress a signal. To formulate it, notice that by performing N reduction steps in our multiresolution, from the coefficients [f N , g N , g N −1 , · · · , g 1 ] we can reconstruct f = f 0 as follows: The approximation of f associated to scale N is thus the function on V: and we have the following Jackson's inequality measuring its quality.
Theorem 21 (Jackson's inequality: quality of the approximation of a signal). For any p ≥ 1 and any f ∈ p (V, µ), letf (N ) be the approximation associated to scale N ≥ 1 in (56). Then For the proof of this statement see Prop. 18 in [3]. It is worth stressing that the second summand involving f p,V is due to the propagation of the error in the intertwining relation, it would vanish if the generators at all scales were perfectly intertwined. 6.4. Choosing the downsampling and concentration parameters. Intertwining error and localization property have to be our guidelines to choose our downsampling parameter q and our concentration parameter q . Beyond squeezing measurement, another way to look at localization properties is to focus on the reconstruction operator norm: spread out wavelets φx and scaling functions ψx will lead to bad reconstruction properties, i.e., to a large operator norm. These norms are controlled by Equations (52) and (53). Since theR operators only are composed together in the reconstruction scheme, Equation (52) is the crucial inequality and we look at it for p = +∞. As far as the intertwining error is concerned we look at Equation (40), for p = +∞ too. These inequalities say that we want q /β andw max /q as small as possible. This implies that we want w max /β as small as possible. Note that this is a random function of q. It does not depend on q , but it is very costly to estimate it by direct Monte-Carlo methods. Now the same kind of algebra we used to prove that the mean hitting time of the root set does not depend on the starting point, offers a workaround. We can estimatew max with  It turns out that these expected values are respectively equal tõ These expected values are easy to estimate by Monte-Carlo simulations for q between two bounds of order w max -such a restriction is natural if we expect |V| to be a fraction of |V|-since we have a practical algorithm to sample all the Φ q together. We can then choose q by optimization between the two bounds. We refer to section 6 in [3] for more details. It remains to choose q once we have chosen q and sampled Φ q . It turns out that the previous estimations onw max and 1/β suggest that the norm of the composedR could be bounded by |V| by choosingw max /q and q /β of the same order. This is actually ensured by setting q = 2w max |R(Φ q )| |V \ R(Φ q )| .
While ensuring numerical stability of the algorithm, this, in turn, essentially amounts to make intertwining error and localization property of the same importance. We refer once again to section 6 in [3] for details and we conclude this survey with giving some numerical experiments based on these results.
6.5.1. Comparison with classical wavelet algorithms on the torus. In Figure 10, we show two steps of the multiresolution analysis for intertwining wavelets and Daubechies-12 wavelets applied to the signal shown on Figure 9. We refer to [3] for the description of the natural compression algorithm that is associated with our multiresolution scheme. Figure 11 shows the relative compression error in terms of the percentage of kept detail coefficients. In Figure 12, we compare the original signal with the compressed one we obtained by keeping 10% of the detail coefficients. In Figures 13 and 15, we show compressed images of a rectangle and a cameraman, respectively, obtained by keeping the same number of coefficients. Figures 14 and 16 show the upsampled approximations (obtained by setting to zero all detail coefficients). Figure 10. Two steps of the multiresolution for the intertwining wavelets (first graph) and Daubechies-12 wavelets. The three parts of a graph give (g 1 , g 2 , f 2 ). Figure 11. Relative compression error of signal in Figure 9 in terms of the percentage of kept normalized detail coefficients. In red, the error using intertwining wavelets. In blue, error using Daubechies12 wavelets. 6.5.2. A last example. We took from [38] and the GSP toolbox [34] the sensor graph and the signal represented in Figure 17 together with two steps of the multiscale analysis. In Figure 18 we compare the results of the intertwining compression algorithm with those of the spectral graph wavelet pyramidal algorithm. Unless specified, in all the previous experiments we always included our sparsification procedure. In this last figure we added the result of the algorithm without sparsification. It is worth to note that, in this example at least, sparsification helps, not only for algorithmic complexity.      Without sparsification With sparsification Figure 18. Relative compression error of signal in Figure 17(a), in terms of the number of kept coefficients. In red, the error using intertwining wavelets. In blue, error using the spectral graph wavelets pyramidal algorithm. In the first graph, we do not sparsify the graph, while we perform sparsification on the second.