Deep composition of tensor-trains using squared inverse Rosenblatt transports

Characterising intractable high-dimensional random variables is one of the fundamental challenges in stochastic computation. The recent surge of transport maps offers a mathematical foundation and new insights for tackling this challenge by coupling intractable random variables with tractable reference random variables. This paper generalises the functional tensor-train approximation of the inverse Rosenblatt transport recently developed by Dolgov et al. (Stat Comput 30:603--625, 2020) to a wide class of high-dimensional non-negative functions, such as unnormalised probability density functions. First, we extend the inverse Rosenblatt transform to enable the transport to general reference measures other than the uniform measure. We develop an efficient procedure to compute this transport from a squared tensor-train decomposition which preserves the monotonicity. More crucially, we integrate the proposed order-preserving functional tensor-train transport into a nested variable transformation framework inspired by the layered structure of deep neural networks. The resulting deep inverse Rosenblatt transport significantly expands the capability of tensor approximations and transport maps to random variables with complicated nonlinear interactions and concentrated density functions. We demonstrate the efficiency of the proposed approach on a range of applications in statistical learning and uncertainty quantification, including parameter estimation for dynamical systems and inverse problems constrained by partial differential equations.


Introduction
Exploration of high-dimensional probability distributions is a fundamental task in statistical physics, machine learning, uncertainty quantification, econometrics, and beyond. In many practical scenarios, high-dimensional random variables of interest follow intractable probability measures that exhibit nonlinear interactions and concentrate in some sub-manifolds. This way, one cannot directly simulate the random variables of interest but may be able to evaluate the unnormalised density function pointwise.
Suppose we have an intractable target probability measure ν π with the unnormalised density function π(x) over a parameter space X ⊆ R d , for example, the posterior measure in a Bayesian inference problem. Various approaches have been proposed to characterise ν π using some reference probability measure µ defined over U ⊆ R d , where independent and identically distributed random variables can be drawn from, e.g., a uniform or a Gaussian. For example, Markov chain Monte Carlo (MCMC) methods [36,53] generate a Markov chain of random variables converging to ν π using µ as the proposal; and importance sampling and/or sequential Monte Carlo [32,48] characterise ν π using weighted samples drawn from µ. The recently developed transport map idea, e.g., [4,17,39,43,50], offers new insights for this task by identifying a measurable mapping, T : U → X , such that the pushforward of µ, denoted by T µ, is a close approximation to ν π . Then, the mapping T can be used to either accelerate classical sampling methods such as MCMC or to improve the efficiency of importance sampling. In this work, we generalise the tensor-train (TT) approach of [17] to offer an order-preserving and multi-layered construction of transport maps that is suitable for high-dimensional random variables with nonlinear interactions and concentrated density functions.

Outline and contributions
The TT-based construction of [17] realises the mapping T via a separable TT decomposition [47] of the target density function. Since the separable tensor decomposition enables the marginalisation of the target density at a computational cost scaling linearly in the dimension of the random variables, it offers a computationally viable way to approximating marginal and conditional density functions of the target measure. In turn, the cumulative distribution functions (CDFs) of the marginals and conditionals define the Rosenblatt transport 1 that can couple the target measure with the uniform reference measure. Section 2 presents the relevant background of the Rosenblatt transport, the functional form of the TT decomposition of multivariate functions [2,25,26,28], and the TT-based construction of the inverse Rosenblatt transport.
The TT-based construction faces several challenges. First, the TT decomposition of the nonnegative target density function often cannot preserve the non-negativity after rank truncations. This way, the resulting Rosenblatt transport may not preserve the monotonicity. Second, TT decomposition works best when the correlations between random variables are local, i.e., the correlation decays with the distance between the indices of the variables. In the extreme case of independent random variables, the joint density factorises into the product of marginal densities. However, high-dimensional random variables of interest often have concentrated density functions and exhibit complicated nonlinear interactions. In such cases, one may need a TT with high ranks to approximate the target probability density with sufficient accuracy, which in turn requires a rather large number of target density evaluations during the TT construction.
In Section 3, we overcome the first challenge by proposing a new construction of inverse Rosenblatt transport by approximating the squared root of the target density in the TT format, followed by constructing the marginal and conditional densities from the square of the TT approximation. The resulting squared inverse Rosenblatt transport (SIRT) is order-and smoothness-preserving. In addition, utilising the squared structure of the approximation, we also establish error bounds of SIRT in terms of various statistical divergences within the f -divergence family. These bounds are useful for bounding the statistical efficiency of posterior characterisation algorithms such as MCMC and importance sampling. Fig. 1 Illustration of DIRT. Top row shows a sequence of bridging measures towards the target measure ν π . Each layer of DIRT identifies an incremental mapping that couples the reference measure (bottom row) with the pullback of the bridging measure under existing composition of mappings (middle row), which admits a simpler structure for constructing TT decomposition.
In Section 4, we circumvent the second challenge by introducing a multi-layer deep inverse Rosenblatt transport (DIRT) that builds a composition of SIRTs guided by a sequence of bridging measures with increasing complexity. We illustrate this idea in Figure 1. At each layer of DIRT, we aim to obtain a composition of SIRTs, denoted by T 0 •T 1 •· · ·•T k , such that the pushforward of the reference measure under this composition is a close approximation of the k-th bridging measure ν k . The existing composition T 0 • T 1 • · · · • T k offers a nonlinear transformation of coordinates that can effectively capture the correlations and support of the next bridging measure ν k+1 . As a result, the density of the pullback measure, (T 0 • T 1 • · · · • T k ) ν k+1 , can have a much simpler structure for building the TT decomposition compared with the density of ν k+1 . We can then factorise the density of (T 0 • T 1 • · · · • T k ) ν k+1 to define the incremental mapping T k+1 such that (T k+1 ) µ = (T 0 • T 1 • · · · • T k ) ν k+1 . This way, DIRT is capable of characterising random variables with concentrated density functions by factorising a sequence of less complicated density functions in transformed coordinates. To further improve the efficiency, we also present strategies that can embed general reference measures rather than the uniform reference measure to avoid complicated boundary layers during DIRT construction. Moreover, we can show that the DIRT construction is robust to TT approximation errors in various statistical divergences, in the sense that the error bounds on a range of divergences is a linear combination of errors of TT decompositions involved in the DIRT construction process.
In Section 5, we integrate SIRT and DIRT into existing MCMC and importance sampling methods to further reduce the estimation and sampling bias due to TT approximation errors. In Section 6, we demonstrate the efficiency and various aspects of DIRT on several Bayesian inverse problems governed by ordinary differential equations (ODEs) and partial differential equations (PDEs). Using a predator-prey dynamical system (Section 6.1), we benchmark the impact of various tuning parameters of the functional TT decomposition such as the TT rank, the number of collocation points and the choice of the reference measure on the accuracy of the DIRT. Using an elliptic PDE (Section 6.3), we are able to compare the single-layered SIRT with DIRT, in which DIRT shows a clear advantage in both the computational efficiency and the accuracy over the single-layered counterpart. In the same example, we also demonstrate the efficiency of TT with the Fourier basis compared to that with the piecewise-linear basis on concentrated posterior measures due to increasing number of measurements and decreasing measurement noises. Furthermore, we can vary the discretisation of the underlying ODE or PDE models from layer to layer to accelerate the DIRT construction. For an example involving a computationally expensive parabolic PDE (Section 6.4), we employ models with increasingly refined grids to construct DIRT that is otherwise computationally infeasible to build.

Related work
Apart from building transport maps using TT decompositions, most of other methods approximate the transport map T by solving an optimisation problem such that T minimises some statistical divergence between the target ν π and the pushforward T µ. The mapping T often has a triangular structure, which is computationally efficient for evaluating the Jacobian and the inverse of T , and can be represented using polynomials [4,43,50,51], kernel functions [15,37], invertible neural networks [7,9,10,35,49,52], etc. In this setting, the objective function has to be approximated using a Monte Carlo average and minimised by some (stochastic) gradient-based method. Depending on the objective function and how samples are obtained, the resulting methods may have very different structures.
-Density approximation. The work of [4,43,51] aims to minimise the Kullback-Leibler (KL) divergence of the pushforward T µ from the target ν π , in which the pushforward density naturally approximates the target density. In this case, the KL divergence is approximated using the Jacobian of T and the target density function evaluated at samples drawn from the analytically tractable reference measure. The resulting optimisation problem may be highly nonlinear and non-convex.
In each optimisation iteration, the target density function has to be re-evaluated as reference samples are transported by the updated map. Our TT-based methods also rely on approximations to the target density. However, TT approximations employ highly efficient deterministic sampling algorithms such as TT-Cross [46], which are free from either gradient or Monte Carlo. This way, TT-based methods may need less number of density evaluations to accurately approximate the target density.
-Density estimation. The strategy adopted by normalising flows (e.g., [7,9,10,35,49,52]) and the work of [50,61,63] offer an alternative that can bypass evaluations of the target density. Instead, these methods assume availability of samples drawn from the target measure and construct objective functions using a given set of target samples. Many of these methods, particularly neural networks, were originally designed to approximate high-dimensional distributions of naturally available samples, such as images. However, in our context, the intractable target random variable X cannot be simulated directly. One has to assume that there exists an auxiliary random variable Y such that the density function of X is given by a conditional density π(x|y) and the pair of joint random variables (X,Y ) can be simulated directly. This way, density estimations can be employed to first build a mapping from some higher dimensional reference measure to the joint measure of (X,Y ), and then obtain the mapping T by conditioning on a particular realisation of Y = y. We provide a concrete example of normalising flows and its comparisons with DIRT in Section 6.1. -Greedy methods. In-between the fully data driven density estimation and the function driven density approximation is the greedy strategy, including the Stein variational gradient descent method [37], its Newton variant [15], and the lazy maps [4]. While greedy methods build transport maps sharing a similar composition structure with DIRT, they obtain the composition of mappings by iteratively minimising the KL divergence of the pushforward measure under the current composition of maps from the target ν π . To relax the burden in optimisation, the class of mappings used in each layer of greedy methods is often restricted, for example, to reproducing kernel Hilbert space with Gaussian kernels [15,37] and sparse low-order polynomials [4]. As a result, greedy methods often need a rather large number of layers to accurately approximate concentrated target densities, and hence may lead to a large number of computationally costly target density evaluations. Compared to the greedy strategy, the usage of bridging measures allows DIRT to construct TT decompositions in different layers with arbitrary accuracy. The total error in DIRT is also accumulated linearly with the number of layers. We provide a numerical comparison on the performance of DIRT and the Stein variational Newton methods [15] in Section 6.1.

Background
In this section, we first introduce some notation and assumptions used throughout the paper. Then, we review the inverse Rosenblatt transport method that offers an algebraically exact transformation from the reference measure to the target measure. We will also discuss the role of the functional TT decomposition in the numerical construction of the (approximate) inverse Rosenblatt transport.

Notation
We consider probability measures that are absolutely continuous with respect to the Lebesgue measure. Suppose a mapping S : X → U is a diffeomorphism and a probability measure ν has a density p(x), the pushforward of ν under S, denoted by S ν, has the density: Similarly, given a probability measure λ with a density q(u), the pullback of λ under S, denoted by S λ , has the density: The short hand X ∼ ν is used to refer a random variable X with the law ν. For a ν-integrable function q : X → R, the expectation of q is denoted by ν(q) = q(x)ν(dx).
We assume the parameter space X and the reference space U can be expressed as Cartesian products X = X 1 × X 2 × · · · × X d and U = U 1 × U 2 × · · · × U d respectively, where X k ⊆ R and U k = [0, 1]. Using product-form Lebesgue measurable weighting functions λ ( , the weighted L p norms on X and U can be expressed as For a vector x ∈ R d and an index k ∈ N such that 1 < k < d, we express the first k − 1 coordinates and the last d − k coordinates of x as respectively. Similarly, we write x ≤k = (x <k , x k ), x ≥k = (x k , x >k ), x ≤1 = x 1 , and x ≥d = x d . For any non-negative function π ∈ L 1 λ (X ) , we define its marginal functions as with π ≤d (x ≤d ) = π(x). The marginal functions should not be confounded with π 1 (x), π 2 (x), ..., π k (x) where the subscript indexes a sequence of functions on X .

Inverse Rosenblatt transport
We start with a d-dimensional uniform reference probability measure, µ uni , defined in a unit hypercube U = [0, 1] d , which has the probability density function (PDF) f uni (u) = 1. We aim to characterise a target probability measure ν π with the PDF Here, π ∈ L 1 λ (X ) is the unnormalised density function (with respect to the weight λ ) that is nonnegative, i.e., π(x) ≥ 0, ∀x ∈ X , and z is the normalising constant that is often unknown.
Let X := (X 1 , ..., X d ) be the target d-dimensional random variable with law ν π and U be the reference d-dimensional random variable with law µ uni . The Rosenblatt transport offers a viable way to constructing a map F : R d → R d such that F(X) = U. As explained in [6,58,64], the principle of the Rosenblatt transport is the following. For 1 ≤ k ≤ d, we denote the marginal PDF of the k-dimensional random variable X ≤k := (X 1 , ..., X k ) by and the PDF of the conditional random variable X k |X <k by This way, the CDF of X 1 and the conditional CDF of X k |X <k can be expressed as respectively. Under mild assumptions [6], the following sequence of transformations . . .
defines uniquely a monotonically increasing map F : X → U in the form of such that the random variable U = F(X) is uniformly distributed in the unit hypercube [0, 1] d . Since the k-th component of F is a scalar-valued function depending on only the first k variables, that is, F X k |X <k : R k → R, the map F has a lower-triangular form. Furthermore, the map F (as well as its inverse) is almost surely differentiable and satisfies ν π -almost surely. Suppose one can compute the Rosenblatt transport. Then, it provides a viable way to characterising the target measure. One can first generate uniform random variables U ∼ µ uni and then applying the inverse Rosenblatt transport (IRT) to obtain a corresponding target random variable X ∼ ν π . The inverse Rosenblatt transport T ≡ F −1 : U → X is also lower-triangular and can be constructed by successively inverting the Rosenblatt transport for k = 1, ..., d: The evaluation of each F −1 X k |X <k (u k |x <k ) requires inverting only a scalar valued monotone function u k = F X k |X <k (x k |x <k ), where x <k is already determined in the first k − 1 steps. Using the change-of-variables formula, the expectation of a function h : X → R can be expressed as This way, the expectation over the intractable target probability measure can be expressed as the expectation over a reference uniform probability measure, and thus many efficient high-dimensional quadrature methods such as sparse grids [5] and quasi Monte Carlo [16] may apply.

Functional tensor-train
For high-dimensional target measures, it may be not computationally feasible to compute the marginal densities π ≤k , and hence the marginal and the conditional CDFs in (5) for building the inverse Rosenblatt transport. To overcome this challenge, a recent work [17] employed the TT decomposition [47] to factorise the density of the target measure in a separable form, which leads to a computationally scalable method for building the inverse Rosenblatt transport. Here we first discuss the basics of the TT decomposition of a multivariate function.
Since multivariate functions can be viewed as continuous analogues of tensors [28], one can factorise the unnormalised density function using functional-form of TT [2,25,26]. Given a multivariate function h : with r 0 = r d = 1, where the summation ranges r 0 , r 1 , ..., r d are called TT ranks. Each univariate function H represented as a linear combination of a set of n k basis functions where A k ∈ R r k−1 ×n k ×r k is a coefficient tensor. Examples of the basis functions include piecewise polynomials, orthogonal functions, radial basis functions, etc. In general, the TT decompositionh(x) is only an approximation to the original function h(x) because of truncated TT ranks and sets of basis functions used for representing each H Remark 1 For each k, grouping all the univariate functions H (α k−1 ,α k ) k (x k ), we have a matrix valued function H k (x k ) : X k → R r k−1 ×r k that is commonly referred to as the k-th TT core. This way, the TT decomposition can also be expressed in the matrix form We follow the MATLAB notation to denote vector-valued functions consisting of the α k -th column and α k−1 -th row of H k (x k ) by H respectively. In some situations, it is convenient to represent the TT decomposition with grouped coordinates. For example, we can write TT as the functional analogue of the compact singular value decomposition (SVD):h where Given a multivariate function, its TT decomposition can be computed using alternating linear schemes such as the classical alternating least squares method (e.g., [34,46]), density matrix renormalization group methods [30,47,65], and the alternating minimal energy method [18] together with the cross approximation [22,23,24,38] or the empirical interpolation [1,8]. In Appendix 8.1, we detail the cross algorithm used for constructing the functional TT decomposition.

A TT-based inverse Rosenblatt transport
Using the functional TT previously discussed, now we review the TT-based construction of the inverse Rosenblatt transport [17]. Suppose one has the (approximate) TT decompositionπ(x) of the unnormalised target density π(x) in the form of where F k (x k ) : X k → R r k−1 ×r k is the k-th TT core. Then, we can approximate the target PDF bỹ Proposition 1 For k < d, the k-th marginal PDF is given bỹ Proof The marginal function ofπ(x) can be expressed bỹ Using the separable form of the tensor-train, the marginal density then has the form Sincec = Xf (x)λ (x)dx, we havec =F 1 · · ·F d using a similar argument.
The above proposition leads to the marginal PDFf X 1 = 1 cπ ≤1 (x 1 ) λ 1 (x 1 ) and the sequence of conditional probability densities This leads to the CDF and the sequence of conditional CDFs for k = 2, ..., d, and hence the Rosenblatt transport U =F(X). This equivalently defines the inverse Rosenblatt transportT =F −1 . This way, by drawing a reference random variable U ∼ µ uni and evaluating X =T (U), we obtain an approximate target random variable X ∼T µ uni . Note that the pushforward measureT µ uni has the densityf X (x).
To estimate the numerical complexity, let us introduce the maximal number of basis functions n = max k=1,...,d n k , TT rank r = max k=0,...,d r k , and suppose we need to draw N samples fromπ(x). Note that we can precomputeF k+1 · · ·F d with the total cost of O(dnr 2 ) operations, before any sampling starts. Similarly, the conditioning requires the interpolation of F 1 (x 1 ) · · · F k−1 (x k−1 ) at the current sample coordinates, which can be built up sequentially. Each univariate interpolation needs O(nr 2 ) operations in general, but for a piecewise interpolation this can be reduced to O(r 2 ) operations per sample per coordinate. Finally, the assembling of the conditional density requires the multiplication of The total complexity is therefore O(dnr 2 + Ndr 2 + Ndnr) [17].
Constructing the inverse Rosenblatt transport using the TT decomposition of the target density faces several challenges. First, the density function π(x) is non-negative, however, its truncated TT decompositionπ(x) can have negative values-a discrete analogue is that the truncated SVD of a matrix filled with non-negative entries can be negative. The leads to a critical issue: if the set {x ∈ X |π(x) < 0} has nonzero measure under ν π , then the Rosenblatt transport constructed fromπ(x) loses monotonicity. A simple way to circumvent this is to take the modulus of each univariate conditional densityf X k |X <k (x k |x <k ) and then renormalise the modulus before computing the CDF [17]. However, the use of moduli and renormalisations may degrade the smoothness of marginal PDFs and conditional PDFs. This way, the resulting inverse Rosenblatt transport and its induced PDF can lose accuracy and smoothness. More importantly, the construction of the TT decomposition (see Section 8.1 for details) requires evaluating the target density at parameter points where the target density is significant. In practice, the high probability region of a high-dimensional target density, e.g., the posterior in the Bayesian inference context, can be hard to characterise. Thus, it can be challenging to construct the TT decomposition for approximating the target density directly. In the next section, we generalise the TT-based construction of the inverse Rosenblatt transport by tackling the aforementioned challenges.

Squared inverse Rosenblatt transport
We first introduce the SIRT to overcome the negativity issue outlined above. Instead of directly decomposing the unnormalised target density π(x), we first obtain the (approximate) functional TT decompositiong(x) of the square root of π(x) in the form of where G k (x k ) : X k → R r k−1 ×r k is the k-th TT core. This leads to an alternative approximation to the target PDF: where γ > 0 is a constant chosen according to the L 2 error ofg(x). Similar to the process discussed in Section 2.4, we can obtain the SIRT,X =T (U), by constructing the sequence of marginal functionŝ .., d − 1 and computing the normalising constant z. Given a reference random variable U ∼ µ uni , we can evaluateX =T (U) to obtain an approximate target random variableX ∼T µ uni , which has exactly the PDFfX (x). Since the functionπ(x) is positive by construction, we can preserve the smoothness and monotonicity in the resulting SIRTT .
Remark 2 For a target density π(x) satisfying sup x∈X π(x) < ∞, the ratio between π(x) and the approximate densityπ(x) satisfies This bound is essential to ensure the uniform ergodicity of the Metropolis independent algorithm and the rate of convergence of importance sampling schemes defined by SIRT. See Section 5 for further details.

Marginal functions and conditional PDFs
We represent each TT core of the decomposition in (18) as where {φ is the set of basis functions for the k-th coordinate and A k ∈ R r k−1 ×n k ×r k is the associated k-th coefficient tensor. For the k-th set of basis functions, we define the mass matrix M k ∈ R n k ×n k by Then, we can represent the marginal functions bŷ where α 0 = 1 and for a coefficient tensor B k ∈ R r k−1 ×n k ×r k that is recursively defined as follows.
Proposition 2 Starting with the last coordinate k = d, we set B d = A d . Suppose for the first k dimensions (k > 1), we have a coefficient tensor B k ∈ R r k−1 ×n k ×r k that defines a marginal functionπ ≤k (x ≤k ) as in (24). The following procedure can be used to obtain the coefficient tensor B k−1 ∈ R r k−2 ×n k−1 ×r k−1 for defining the next marginal functionπ <k (x <k ): 1. Use the Cholesky decomposition of the mass matrix, L k L k = M k ∈ R n k ×n k , to construct a tensor C k ∈ R r k−1 ×n k ×r k : 2. Unfold C k along the first coordinate [34] to obtain a matrix C (R) k ∈ R r k−1 ×(n k r k ) and compute the thin QR decomposition where Q k ∈ R (n k r k )×r k−1 is semi-orthogonal and R k ∈ R r k−1 ×r k−1 is upper-triangular. 3. Compute the new coefficient tensor Furthermore, at index k = 1, the unfolded C 1 along the first coordinate is a row vector C Proof See Appendix 8.2.

Proposition 3
The marginal PDF ofX 1 can be expressed aŝ where .., n 1 and α 0 = 1. For k > 1 and a given x <k , the conditional PDF ofX k |X <k can be expressed aŝ where D k ∈ R n k ×r k is given by Proof The above results directly follow from the definition of conditional PDF and the marginal functions in (23) and (24).
Note that the product G 1 (x 1 ) · · · G k−1 (x k−1 ) requires k − 1 univariate interpolations and k − 2 products of matrices per sample, that is the same operations as in the standard inverse Rosenblatt transport. The QR decomposition (28) and the construction of the coefficient tensors (29) need O(dnr 3 ) operations, but these are pre-processing steps that are independent of the number of samples. However, in contrast to the vector-valued function F k (x k )F k+1 · · ·F d ∈ R r k−1 , in evaluating the PDFfX , we need to multiply the matrix-valued function P k (x k ) ∈ R r k−1 ×r k for each sample. Thus, the leading term of the complexity becomes O(Ndnr 2 ), one order of r or n higher than the complexity of the standard inverse Rosenblatt transport. However, for small r and n this is well compensated by a smoother map, which will be crucial in Section 4.

Implementation of CDFs
To evaluate SIRT, one has to first construct the marginal CDF ofX 1 and the conditional CDFs of X k |X <k for k > 1, and then inverts the CDFs (see (8)). Here we discuss the computation and the inversion of CDFs, which are based on pseudo-spectral methods, for problems with bounded domains and extensions to problems with unbounded domains. We refer the readers to [3,57,62] and references therein for a more details.

Bounded domain with polynomial basis
For a bounded parameter space X ⊂ R d , we consider the weighting function λ (x) = 1. Since X can be expressed as a Cartesian product, without loss of generality, here we discuss the CDF of a onedimensional random variable Z with the PDF where are the basis functions, D ∈ R n×r is a coefficient matrix, and ζ ∈ [−1, 1]. Heref Z (ζ ) can be either the marginal PDF or the conditional PDFs defined in Proposition 3 with a suitable linear change of coordinate.
We first consider a polynomial basis, φ (i) (z) ∈ P n−1 for i = 1, ..., n, where P n−1 is a vector space of polynomials of degree at most n − 1 defined on [−1, 1]. Thus, the PDFf Z (ζ ) can be represented exactly in P 2n−2 . To enable fast computation of the CDF, we choose the Chebyshev polynomials of the second kind as the basis of P 2n−2 . Using the roots of p 2n−1 (ζ ), we can define the set of collocation points This way, by evaluatingf Z (ζ ) on the collocation points, which needs O(nr) operations, one can apply the collocation method [3, Chapter 4] to representf Z (ζ ) using the Chebyshev basis: where the coefficients {a m } 2n−2 m=1 can be computed by the fast Fourier transform with O(n log(n)) operations. Then, one can express the CDF of Z as where t m (ζ ) = cos m cos −1 (ζ ) is the Chebyshev polynomial of the first kind of degree m. A random variable Z can be generated by drawing a uniform random variable U and evaluating Z = F −1 Z (U) by solving the root finding problem F Z (Z) = U. (32) is positive for all ζ ∈ [−1, 1] by construction and can be represented exactly in P 2n−2 with the polynomial basis. Thus, its Chebyshev representation in (33) is also positive. This way, the resulting CDF in (34) is monotone, and thus the solution to the inverse CDF equation, F Z (Z) = U, admits a unique solution.

Remark 3 The PDF in
Remark 4 One can also employ piecewise Lagrange polynomials as a basis to enable hp-adaptivity. With piecewise Lagrange polynomials, the above-mentioned technique can also be used to obtain the piecewise definition of the CDF.
Since F Z (Z) = U has a unique solution and F Z is monotone and bounded between [0, 1], it requires usually only a few iterations to apply the root finding methods, such as the regula falsi method and the Newton's method, to solve F Z (Z) = U with an accuracy close to machine precision. Overall, the construction of the CDF needs O(nr + n log(n)) operations, and the inversion of the CDF function needs O(cn) operations, where O(n) is the cost of evaluating the CDF and c is the number of iterations required by the root finding method. In comparison, building the matrix D requires O(nr 2 ) operations (cf. Proposition 3).

Bounded domain with Fourier basis
If the Fourier transform of the PDF of Z, which is the characteristic function, is band-limited in the frequency domain, then one may choose the sine and cosine Fourier series as the basis for representing the PDF in (32). In this case, the above strategy can also be applied. Recall the Fourier basis with an even cardinality n, 1, ..., sin(mπζ ), cos(mπζ ), ..., cos(nπζ /2) , m = 1, ..., n/2 − 1, which consists of n/2 − 1 sine functions and n/2 + 1 cosine functions. The PDFf Z (ζ ) defined in (32) yields an exact representation using the Fourier basis with cardinality 2n. This way, one can represent f Z (ζ ) asf where the coefficients, a m and b m , are obtained by evaluatingf Z (ζ ) on the collocation points and applying the rectangular rule. This leads to the CDF The construction and the inversion of the CDF using the Fourier basis cost a similar amount of operations compared to the polynomial basis.

Unbounded domain
Given an unbounded domain, the simplest approach is to truncate the domain at the tail of the PDF.
With the domain truncation, the above-mentioned implementations based on Chebyshev and Fourier basis can be applied directly. Although the function approximation error induced by the domain truncation can be bounded, using the resulting SIRT for computing expectations may lead to a biased estimator.
One can also consider basis functions that are intrinsic to an unbounded domain. For the domain X k = (0, ∞), one can employ the Laguerre polynomials as the basis. This equips X k with a natural exponential weighting function λ k (x k ) = exp(−x k ). The collocation method using higher order Laguerre polynomials can be applied again to obtain the exact representation of the CDF. Similarly, for X k = (−∞, ∞), the Hermite polynomials can be used as a basis, which equips X k with a Gaussian weighting function λ k (x k ) = exp(− 1 2 x 2 k ). Although one can apply the collocation method to obtain an algebraically exact representation of the CDF, the resulting CDF involves error functions, complementary error functions, and imaginary error functions. Those functions have to be approximated numerically. Thus, the computational cost of computing the CDF can be high, and it may be hard to guarantee the monotonicity and uniqueness of the inverse CDF solution at the tails. Using other bases such as the Whittaker cardinal functions for X k = (−∞, ∞) may face a similar challenge.

Remark 5
In a situation where the squared form of the PDF in (32) can be computed but it is challenging to invert the CDF function, one can employ the rejection sampling [53] to generate random variables. In this situation, our TT approximation can still be used to draw conditional samples. However, this approach may not lead to the deterministic inverse Rosenblatt transport.

Change of coordinate
One can also apply a diffeomorphic mapping to change the coordinate of an unbounded domain X to a bounded one, e.g., Z = [−1, 1] d , followed by application of the Chebyshev polynomials or Fourier series. Given a PDF f X (x) of a random variable X ∈ X , suppose we have a diffeomorphic mapping R : X → Z and let q(x) = |∇R(x)| ≥ 0. For any Borel set B X ⊆ X , we have and apply the mapping X = R(Z) to obtain a random variable X. With the change of the coordinate ζ = R(x), one needs to build a TT to approximate f Z (ζ ) and construct the corresponding SIRT to simulate the random variable Z. To avoid singularities at the boundary of Z, one can choose a mapping R such that the function q(x) decays slower than f X (x).

SIRT error
Since SIRT enables us to generate i.i.d. samples from the probability measureT µ uni , it can be used to define either Metropolis independence samplers or importance sampling schemes. Based on certain assumptions on the TT approximationg, here we establish error bounds for the TV distance, the Hellinger distance, and the χ 2 -divergence of the target measure ν π fromT µ uni . These divergences play a vital role in analysing the convergence of Metropolis-Hastings methods and the efficiency of importance sampling. The error analysis also provides a heuristic for choosing the constant γ in the approximate target density (19).
is a TT approximation to the square root of the unnormalised target density √ π. Suppose the error ofg and the constant γ satisfy respectively. Then, the error of Proposition 5 Suppose the conditions in (35) hold. Then, the approximate normalising constantẑ satisfies Proof The normalising constants z andẑ satisfy Applying the Hölder's inequality (with p = q = 2) and the Minkowski inequality, the right hand side of the above inequality also satisfies Since both z andẑ are positive, we have Substituting this identity and the inequality in (37) . Thus, the result follows from Proposition 4.
Theorem 1 Suppose the conditions in (35) hold. The Hellinger distance between ν π andT µ uni satis- Proof Since the target measure ν π and the approximate measureT µ uni respectively have the densities 1 z π(x) λ (x) and 1 zπ (x) λ (x), the squared Hellinger distance satisfies This leads to the inequality Applying √π 2 L 2 λ (X ) =ẑ and Propositions 4 and 5, the above inequality can be further reduced to Corollary 1 Suppose the conditions in (35) hold. The total variation distance between ν π andT µ uni satisfies Proof The result directly follows from the inequality D TV ≤ √ 2D H and Theorem 1.

Proposition 6
Given two probability measures ν π andνπ and a function h with finite second moments with respect to ν π andνπ . Then Proof Suppose ν π andνπ respectively have density functions f X (x) andfX (x) with respect to the Lebesgue measure. We have the following inequality Thus, the result follows.

Remark 6
The TV distance, Hellinger distance, and χ 2 -divergence of the target measure ν π from the pushforward of the reference µ uni under the SIRTT are linear in the approximation error of the TT.

Deep inverse Rosenblatt transport
In many practical applications, probability densities can be concentrated to a small region of the parameter space or have complicated correlation structures. For example, posterior densities in Bayesian inference problems with informative data often occupy a relatively small region of the parameter space and demonstrate complicated nonlinear interactions in some sub-manifold, see [11,12,21,50] for detailed examples. In this situation, straightforward approximation of a complicated density function in a TT decomposition may require rather large ranks. As the number of function evaluations needed in constructing TT decompositions grows quadratically with the ranks, such direct factorisation of the target densities with complicated structures may become infeasible.
Example 1 Consider a d-dimensional multivariate normal distribution with the unnormalised density π(x) = exp(− 1 2 x C −1 x). If the covariance matrix C ∈ R d×d is diagonal, the joint density factorises into a product of marginal densities, that is a TT decomposition with ranks 1. This corresponds to zerorank off-diagonal blocks C[1 : k, (k + 1) : d], k = 1, ..., d − 1. However, the TT ranks of a correlated normal density π(x) may grow exponentially in the rank of the off-diagonal blocks of C [54].
We design a DIRT framework to construct a composition of order-preserving mappings in the SIRT format that can characterise concentrated probability densities with complicated correlation structures. The construction of DIRT is guided by a sequence of bridging probability measures ν 0 , ν 1 , ..., ν L , where ν L = ν π is the target measure. Each bridging measure ν k has the corresponding PDF Here π 0 (x) is the unnormalised initial density such that sup x∈X π 0 (x) < ∞, π L (x) = π(x) is the unnormalised target density, and the superscript k indexes the random variable X k ∈ X , X k ∼ ν k . Our goal is to construct a composition of mappings T 0 • T 1 • · · · • T k such that the pushforward of the reference measure under this composition matches the k-th bridging measure, i.e, This way, by gradually increasing the complexity in the geometry and/or the computational cost of the densities of the bridging measures, it becomes computationally feasible to construct TT and the corresponding SIRT at each layer of the composition.
Assumption 2 Denoting the ratio between two unnormalised densities by we assume that for each pair of j < k, the ratio r k, j (x) is finite such that In practice, there are many ways to choose the bridging measures. For example, one can consider tempered distributions [20,31,40,44,60] where π k (x) = π(x) β k for a suitable chosen set of powers (reciprocal temperatures) 0 ≤ β 0 < · · · < β L = 1; and for problems involving computationally expensive PDE models, one can employ a hierarchy of models with different grid resolutions to reduce the computational cost for building the DIRT. In the rest of this section, we will present the recursive construction of DIRT and provide error analysis.

Recursive construction
In the initial step (k = 0), we compute a TTg 0 (x) that approximates √ π 0 and construct the corresponding SIRTX 0 =T 0 (U) so that the reference random variable U ∼ µ uni andX 0 ∼ (T 0 ) µ uni with the PDFfX where the constant γ 0 is chosen such that 0 is an approximation to f X 0 (x).

Remark 7
We can replace the uniform reference measure µ uni with a general product-form probability measure µ that has the PDF f U After k > 1 steps, suppose we have the k-th DIRT given as a composition of mappings where eachT j is a SIRT. Denoting the pushforward of the reference probability measure µ underT k byν k , i.e.,ν k ≡ (T k ) µ, and the density function ofν k byfXk (x), the pullback density ofν k underT k satisfiesT The density function of the pullback probability measureT kνk is the reference product density f U (u).
Suppose the corresponding approximate PDFfXk (x) can capture the range of variation and the correlation structure of the next PDF f X k+1 (x), then the density function of the pullback probability measurē may become easier to factorise in the TT format compared to the direct factorisation of the original target density function f X k+1 (x). This way, for step k + 1, the existing compositionT k can be used to precondition the construction of the coupling between µ and ν k+1 : by building a coupling between the pullback measureT k ν k+1 and the reference measure We use SIRT to approximate T k+1 . Using (42) the pullback density in (43) can be expressed as a ratio function This way, we can compute a TTg k+1 (u) to approximate the function where ω(u) is the weighting function associated with the reference domain U and a ∝ b denotes that a is proportional to b. Since z k , the normalising constant of f X k+1 , is unknown, here we only need to factorise an unnormalised version of q k+1 (u) into a TT. The normalising constant is computed automatically during the marginalisation process of SIRT (see Proposition 2). The SIRT U k+1 =T k+1 (U ) built from the TTg k+1 (u) couples the uniform reference random variable U ∼ µ uni with U k+1 ∼ (T k+1 ) µ uni . Thus, the composition of transformations U k+1 = (T k+1 • R)(U) couples the general reference random variable U ∼ µ with U k+1 ∼ (T k+1 • R) µ, where (T k+1 • R) µ is an approximation to the pullback measureT k ν k+1 . Thus we have The next DIRT is therefore defined by the new composition of mappings The recursion is completed by obtainingT L andT L .
Proposition 7 At the j-th DIRT step, the Jacobian of the incremental mappingT j is given by where the constant γ j > 0 is chosen according to the L 2 error ofg j .
Proof The SIRT U j =T j (U ), which is constructed by integrating γ j +g j (u) 2 ω(u), maps the uniform random variable U ∼ µ uni to U j ∼ (T j ) µ uni . Thus, the pushforward measure (T j ) µ uni has the density functionp U j (u) defined in (46), which yieldŝ . Lemma 1 At step k of the DIRT construction, suppose we have the DIRTs Suppose further we have a normalised density function over the domain X defined in (41) for j = 0, and normalised density functions over the reference domain U defined in (46) for j = 1, ..., k. Then, the pushforward measure (T k ) µ has the PDF Proof The result can be shown using induction. For the case k = 0, the result follows directly from (41). Suppose (48) holds for k > 0. We define the composition of mappings Since R µ = µ uni , we have the identity (T k ) µ = (T k ) µ uni , which leads tô At step k + 1, we have the new composition of mappingsT k+1 =T k • R •T k+1 , the pushforward measures, (T k+1 ) µ and (T k ) µ, have the density functionŝ respectively. Applying the change of variables u =T −1 Note that the above change of variables implies also that . Together with Proposition 7, we have Thus, the result follows.
Corollary 3 At step k of the DIRT construction, the composition of mappings,T k , satisfies .
Proof The result direct follows fromT −1 k = R −1 •T −1 k and the proof of Lemma 1.

Remark 8
The normalised PDFs of the k-th DIRT step can be expressed aŝ wherez k = ∏ k j=0ẑ j is an approximation to the normalising constant z k , and is an approximation to the unnormalised bridging density π k (x).

Ratio functions and error analysis
We will first discuss the ratio function (45) and its approximation and then present the corresponding error analysis.

Ratio functions
Given the unnormalised PDF in (51), the pullback density in (44) can be expressed as This way, we need to compute a TTg k+1 (u) to approximate the function to build the SIRTT k+1 . We call this strategy the exact ratio approach. Alternatively, the pullback density in (44) can be expressed as Since the DIRT density functionπ k approximates the k-th unnormalised bridging density function π k , the pullback density in (53) can be approximated as This way, we need to compute a TTg k+1 (u) that approximates the functioñ to build an alternative SIRTT k+1 . We call this strategy the approximate ratio approach.
Remark 9 For all k ≥ 0, we want the ratio π k /π k to be finite in U. Otherwise, it may cause large errors in the TT decomposition and may deteriorate the convergence of the resulting sampling schemes for characterising ν π . Given Assumption 2 and γ j > 0 for all j = 0, ..., k, we have and thus it can be shown (using induction) that the ratio π k /π k is bounded.

Remark 10
In some situations, the ratio function (r k+1,k •T k )(u) may exhibit sharp boundary layers if the uniform reference measure µ uni (with R = I) is used. This can increase the complexity of the resulting TT decompositions. Apart from carefully choosing the bridging measures, a partial remedy to the boundary layer is to use a reference measure with the density f U (u) decaying towards the boundary, such as the normal density truncated on a sufficiently large hypercube [−σ , σ ] d . The function f U (u) in (52) and (54) smoothens the previous approximation errors, which can improve the accuracy of TT approximations. With a reference measure defined on a hypercube, the collocation techniques based on Chebyshev and Fourier bases (see Section 3.2) can be applied to construct and evaluate functional TT decompositions in DIRT.

DIRT error
Based on assumptions on the TT error at each layer of the DIRT construction, here we first establish bounds on approximation errors of the DIRT-induced approximate densityπ k for both the exact ratio approach and the approximate ratio approach. The error analysis also provides heuristics for choosing the constants γ j . Then, the error bounds of π k leads to bounds on the TV distance, the Hellinger distance, and the χ 2 -divergence of the k-th bridging measure ν k from the pushforward measure (T k ) µ.
Theorem 3 (Exact ratio approach) At the k-th DIRT step (k > 0), suppose the TT decompositioñ g k ≈ q k and the constant γ k respectively satisfy where q k is defined in (52). Then the unnormalised PDFπ k (51) approximates the unnormalised density function of the k-th bridging measure π k with the error √ π k − π k L 2 λ (X ) ≤ 2z k−1 ε k , wherez k−1 = ∏ k−1 j=0ẑ j is the normalising constant of the unnormalised PDFπ k−1 .
Proof We start with the identity Applying the change of variables u =T −1 k−1 (x) and Equations (51) and (52), we have In addition, where the second last inequality follows from the same proof of Proposition 4.
Theorem 4 (Approximate ratio approach) Suppose the initial TT decompositiong 0 ≈ √ π 0 and the constant γ 0 satisfy respectively. At the k-th DIRT step (k > 0), suppose further the TT decompositionsg j ≈q j and the constants γ j satisfy respectively, whereq j is defined in (54). Then the unnormalised PDF of DIRT defined by (51) approximates the k-th unnormalised bridging density function with the error where c k, j = sup x∈X r k, j (x) is given in Assumption 2, andz k = ∏ k j=0ẑ j .
Proof The difference between √ π k and π k can be written as Recalling that c k, j = sup x∈X r k, j (x) and applying Proposition 4, we have Applying the change of variables u =T −1 j−1 (x) and Corollary 3 for each j > 0, we have Thus, the result follows.
Remark 11 At first glance, it appears that Theorem 3 gives smaller errors than Theorem 4. However, this assumes that the two ratio functions in (52) and (54) are approximated with the same TT error ε k . Ideally this should also require the same number of degrees of freedom in TT cores. In practice this may not be the case: the exact ratio (53) carries the previous approximation errors in the term (π k •T k )(u) (π k •T k )(u), which can have a complicated structure that is difficult to approximate in TT. In contrast, the approximate ratio involves only the target densities. For example, if the bridging densities π k were introduced by tempering, the ratio r k+1,k = π β k+1 −β k is just another tempered density. For this reason, DIRT built by the approximate ratio approach may be more accurate in practice.

Corollary 4
Given π k andπ k constructed using either the exact or the approximate ratio functions, we suppose the error of π k satisfies √ π k − π k L 2 λ (X ) ≤ e k . Then, the Hellinger distance and the total variation distance between the k-th bridging measure ν k and the pushforward measure (T k ) µ satisfy Proof The results can be obtained by applying the same proofs of Theorem 1, Corollary 1, and Corollary 2, respectively. Note that for the χ 2 -divergence, the condition sup x∈X π k (x)/π k (x) < ∞ holds as discussed in Remark 9.

Debiasing
Applying either SIRT or DIRT, one can obtain an approximate map T : U → X that enables the simulation of a random variableX ∼ T µ approximating the target random variable X ∼ ν π . In a situation where SIRT and DIRT have high accuracy in approximating the target measure, one can approximate the expectation ν π (h) of a function of interest h(x) directly, using the expectation of h(x) over T µ, i.e., (T µ)(h) ≡ µ(h • T ). The bias of the approximated expectation is proportional to the Hellinger distance D H (ν π T µ) (see Proposition 6). In addition, we can apply the approximate inverse Rosenblatt transport T within the Metropolis-Hastings method and importance sampling to reduce the bias in computing ν π (h). For the sake of completeness, here we discuss some debiasing strategies based on existing work. Given the previous state of the Markov chain X ( j−1) = x, with probability acceptX by setting X ( j) =X, otherwise set X ( j) = x.
We first consider the IRT-MCMC (Algorithm 1), in which the approximate IRT is used as a proposal mechanism in the Metropolised independent sampler for constructing a Markov chain of random variables that converges to the target measure. In the acceptance probability (57), f X (·) is the PDF of the target measure ν π andfX (·) is the PDF of T µ that is defined by either the SIRT (19) or the DIRT (50). Following the result of Mengersen and Tweedie [41], the bounds discussed in Remarks 2 and 9 can guarantee the uniform ergodicity of the Markov chain constructed by Algorithm 1. In addition, the average rejection probability is bounded by 2 D TV (ν π T µ), see Lemma 1 of [17]. This provides an indicator on the performance of the Metropolised independent sampler. However, our bound on D TV (ν π T µ) does not directly connect to the bound on the convergence rate of the Metropolised independent sampler, in which the use of acceptance/rejection may require a more precise control on the pointwise error, e.g., g − √ π L ∞ λ (X ) , to assess the convergence rate of the sampler. One can also employ the approximate IRT built by either the SIRT or the DIRT as the biasing distribution in importance sampling, which leads to the IRT-IS algorithm (Algorithm 2). Compared to IRT-MCMC, IRT-IS generates random variablesX ( j) from the approximate IRT and correct the bias using the weights W j . By avoiding the Markov chain, importance sampling offers several advantages over the Metropolised independent sampler: (i) it can be easily parallelised; and (ii) variance reduction techniques such as antithetic variable and control variates (see [53,Chapter 4] and references therein) and efficient high-dimensional quadrature methods such as quasi Monte Carlo [16] can be naturally applied within importance sampling.
The error bounds established in Sections 3 and 4 offer insights into the efficiency of IRT-IS. As discussed in [48,Chapter 9], for N approximate target random variables, one can use the effective Algorithm 2 IRT-IS 1: for j = 1, 2, ..., N do 2: Draw U ( j) ∼ µ and compute the approximate target random variableX ( j) = T (U ( j) ) .
Lemma 2 Given the χ 2 -divergence of ν π from T µ, the ESS of Algorithm 2 satisfies .
Proof Since we have (T µ)(πλ fX ) = z, where z is the normalising constant of the target density, the χ 2 -divergence of ν π from T µ satisfies in which one can choose w to be the ratio πλ fX multiplied by any nonzero constant. Thus, the result follows.
The bounds discussed in Remarks 2 and 9 imply thatz N andh N computed by Algorithm 2 are unbiased estimators for the normalising constant z and the expectation (T µ)(w h), respectively. However the ratio estimator I N is only asymptotically unbiased such that P lim N→∞ I N = ν π (h) = 1.
For a finite sample size, N < ∞, the ratio estimator I N is a biased estimator of ν π (h). However, for sufficiently large sample size N, one can apply the Delta method (see [48,Chapter 2] and references therein) to show that the mean square error (MSE) of I N yields the approximation Thus, for a sufficiently regular function h, the MSE of the ratio estimator I N can also be controlled by the χ 2 -divergence D χ 2 (ν π T µ).

Numerical examples
We demonstrate the efficiency and various aspects of DIRT, which employs SIRT within each layer,

Predator and prey
The predator-prey model is a system of coupled ODEs frequently used to describe the dynamics of biological systems. The populations of predator (denoted by Q) and prey (denoted by P) change over time according to a pair of ODEs with initial conditions P(t = 0) = P 0 and Q(t = 0) = Q 0 . The dynamical system is controlled by several parameters. In the absence of the predator, the population of the prey evolves according to the logistic equation characterised by r and K. In the absence of the prey, the population of the predator decreases exponentially with a rate v. In addition, the two populations have a nonlinear interaction characterised by α, s, and u. We often do not know the initial populations and the parameters r, K, α, s, u, and v. This way, we need to estimate unknowns from observed populations of the predator and prey at time instances t i for i = 1, ..., n T .

Posterior density
Let y ∈ R 2n T denote the observed populations of the predator and prey. We define a forward model G : to represent the populations of the predator and prey computed at {t i } n T i=1 for a given parameters x. Assuming independent and identically distributed (i.i.d.) normal noise in the observed data and assigning a prior density π 0 (x) to the unknown parameter, one can define the unnormalized posterior density To illustrate the behaviour of the posterior density, we plot the kernel density estimates of the marginal posterior densities in Figure 2. Note that some of the parameters are significantly correlated, which makes the posterior density function difficult to explore by both MCMC and a straightforward TT approximation.

Numerical results
We use L = 8 bridging measures in the construction of DIRT by tempering the unnormalised posterior density with π k (x) = π(x) β k , starting from β 0 = 10 −4 and following by β k+1 = √ 10 · β k . This way, β L = 1 gives the target probability density. We consider two reference measures: the uniform reference measure µ uni and the truncated normal reference measure µ TG with the density Note that at layer 0, the ratio function is just the tempered density π 0 (x) in the original domain x k ∈ [a k , b k ]. We employ the piecewise-linear basis functions with n equally spaced interior collocation points for both reference measures. In addition, we tune TT-cross (Algorithm 4) using three parameters: the initial TT rank R0, enrichment TT ranks ρ 1 = · · · = ρ d−1 = Rho, and the maximum number of TT-cross iterations MaxIt. Those define uniquely the maximum TT rank R max = R0 + Rho · MaxIt.
Firstly, we vary one tuning variable at a time and investigate its impact on the efficiency and computational cost of the DIRT. We take the number of posterior density function evaluations in TTcross in each DIRT layer to measure the computational cost for building DIRT.   In Figure 3, we vary the enrichment rank Rho and the number of TT-cross iterations MaxIt. The initial TT rank R0 is adjusted such that the maximum TT rank is 13 in all cases. We set the number of collocation points to be n = 16. All the DIRTs are constructed using the approximate ratio (54).   Figure 3 (c). In contrast, using the truncated normal reference measure (Figure 3 (b) and (e)) can significantly improve the efficiency in this example. With only one TT-cross iteration, it can reduce the final IACT to below 4 and N/ESS(N) to below 3.

Remark 12
At levels k > 0 of the DIRT construction, the ratio functions may have a similar shape (see Figure 1). Thus, one can take the TT of the ratio function at the previous level k > 0 as the initial guess for building TT at level k + 1. This initialization provides good index sets in TT-cross, such that only one TT-cross iteration is sufficient with the truncated normal reference measure.
In Figure 4, we vary the maximum TT rank R max . With the uniform reference, we set Rho = 3 and MaxIt = 3. With the truncated normal reference, we set Rho = 0 and MaxIt = 1, which makes the number of density evaluations equal to the number of degrees of freedom in the TT decomposition, n 1 r 1 + ∑ d−1 k=2 n k r k−1 r k + r d−1 n d = (d − 2)nR 2 max + 2nR max . We observe that the two reference measures give eventually comparable IACTs and ESSs with increasing R max . However, the truncated normal reference achieves this with much fewer density evaluations.
In Figure 4, we compare also the approximate ratio (54) used in all experiments with the exact ratio (52). The diamond shaped markers in Figure 4 (a) and (b) show IACTs and ESSs obtained by the exact ratio approach. In this example, it gives worse results with a larger IACT and N/ESS(N) for the truncated normal reference measure with lower R max values, and does not lead to any meaningful results for the uniform reference measure.
In Figure 5, we vary the number of collocation points n used in each dimension. The truncated normal reference starts with a larger error since n = 10 points cannot resolve the rather large reference domain [−4, 4]. With increasing n, the IACT obtained using the truncated normal reference decays rapidly. In comparison, the IACT obtained using the uniform reference exhibits a spike and does not show rapid decay with increasing n. This may be caused by the boundary layers in the ratio function. Similar trends are observed in the reported N/ESS(N). Again, the truncated normal reference requires significantly fewer density evaluations to achieve the same level of accuracy compared to the uniform reference in this experiment.
DRAM is initialized with the covariance matrix 5I, adaptation scale 2.4/ √ d, adaptation interval 10 and delayed rejection scale 2. These parameters are commonly recommended in general case.
For this example, SVN is sensitive to the choice of the step size and to the initial distribution of particles. We choose the step size to be 2 · 10 −2 and generate the initial particle set from the normal distribution N (x true , (2 · 10 −2 x true ) 2 ), which gives a reasonable balance between the stability and the rate of convergence. We carry out 23 Newton iterations in SVN to approach stationarity.
HINT is an autoregressive normalising flow estimator for the joint probability density π(y, x). Since both the prior random variable X with the density π 0 (x) and the noise random variable η can be directly simulated, drawing samples from the joint distribution is easy. One can first draw a sample X from the prior, and then simulate the corresponding data sample Y = G(X) + η by generating a noise random variable η. Drawn a set of independent and identically distributed samples (Y (i) , X (i) ) N i=1 from the joint measure, HINT computes a triangular invertible map (u Y , u X ) = S θ (y, x) from the (joint) target measure to the reference measure by minimizing the maximum likelihood over the parameter θ that defines the neural networks S θ . Given observed data y, this allows one to define a conditional map x = T θ (u X ; y) := S X θ −1 (y, u X ) that maps from the reference measure of U X to the posterior measure conditioned on data y. We simulate each method M = 10 times with N samples produced in each simulation, denoted by {x ( , j) } N j=1 , where = 1, ..., M indexes the simulations. For each simulation, we compute the empirical posterior covariance matrix is the empirical posterior mean. Then, we use the average deviation of covariance matrices to benchmark the sampling performance of different sampling algorithms. Here we employ the Förstner-Moonen distance [19] over the cone of symmetric and positive definite (SPD) matrices, where λ i (A, B) denotes the i-th generalised eigenvalue of the pair of SPD matrices (A, B), to measure the deviation. This way, averaging the Förstner-Moonen distance between the -th empirical covariance matrix and the average covariance matrix over all M simulations, provides an estimated deviation of empirical covariance matrices computed by a given algorithm. In Figure 6, we plot the covariance deviations (60) obtained by IRT-MCMC, DRAM, SVN and HINT versus the total number of target density function evaluations and the total CPU time needed by each algorithm. Here the reported total numbers of density evaluations and CPU times include the construction of DIRT in each simulation experiment. The 10 independent simulations are run in parallel on a workstation with a Intel Xeon E5-2640v4 CPU at 2.4GHz. We can notice that DIRT produces estimated covariance matrices with smallest deviations in almost all tests. Moreover, DIRT is computationally more efficient in terms of the CPU time, because the evaluation of DIRT can take advantage of vector instructions.
In this example, HINT gives the worst results since the joint density estimation from samples is a much higher dimensional problem compared to the posterior density approximation. In particular, the dimension of y in the predator-prey model is 2n T = 26, so the total dimensionality of the problem increases to 34. This required us to construct HINT networks with 8 blocks containing 200 × 200 weights each. This totalled to 2 691 308 trainable parameters in the entire HINT. We trained the networks using 5 000 000 training samples for 50 epochs, consisting of 500 batches of 10 000 samples each. The other (e.g. ADAM) parameters are left unchanged from [35]. The training took 5.8 hours on a NVidia GeForce GTX 1650 Max-Q GPU card. To avoid disproportionate scaling of axes in Figure 6 compared to other methods, we put just an indicative marker for HINT. The actual error (60) with 100 000 test samples taken directly from HINT was 16.7. Using the test samples as proposals in the MCMC rejection against the exact posterior gives a rejection rate of 97% and IACT of 127, and the covariance matrix computed from the rejected samples gives the Förstner-Moonen distance of 0.1. This indicates that the data-driven joint density estimation should be more applicable for lower-dimensional data, whereas if one is only interested in the posterior, the function approximation methods seem to be a better choice.

Lorenz-96
This is a widely used benchmark model in atmospheric circulations. We consider a Lorenz-96 model that is specified by the system of ODEs with periodic boundary conditions and an unknown initial condition P i (0) = x i for i = 1, ..., d. The state dimension is set to d = 40. Observing noisy states with even indices at the final time T = 0.1, we aim to infer the initial state x in this example. This way, we have observed data y ∈ R  Assuming i.i.d. normal noise in the observed data and assigning a truncated normal prior density to the initial condition, we have the unnormalized posterior density We use a synthetic data set y = G(x true ) + η, where x true is drawn from ∼ N (1, 10 −4 I d ), and η is a realisation of the i.i.d. zero mean normal noise with the standard deviation σ = 10 −1 .
For the TT-cross approximations, we use the truncated normal reference measure on [−3, 3] d , piecewise linear basis functions with n = 15 interior collocation points, MaxIt = 1 TT-cross iteration, and TT ranks R max = 15. DIRT is built with the tempered density with β 0 = 10 −4 and β k+1 = √ 10 · β k . This way, we need L = 8 layers to reach the posterior density. A weaker tempering of the prior is used to reduce its impact on the intermediate levels. This allows most of the intermediate DIRT levels to be used to bridge the more complicated likelihood. This setup requires a total of 1.2 × 10 6 density evaluations in TT-cross at all layers, and provides an average ESS of N/1.55 in IRT-IS and an average IACT of 2.6 in IRT-MCMC.
Using the posterior density, we can quantify the uncertainty of the inferred initial state and make predictions of the terminal state. The predicted initial state is shown in Figure 7. Note that the chaotic regime of Lorenz-96 makes it difficult to predict the unobserved odd coordinates. Nevertheless, DIRT demonstrates high numerical and sampling efficiency in approximating this complicated posterior.

Elliptic PDE
In the third example, we apply both SIRT and DIRT to the classical inverse problem governed by the stochastic diffusion equation with Dirichlet boundary conditions u| s 1 =0 = 1 and u| s 1 =1 = 0 on the left and right boundaries, and homogeneous Neumann conditions on other boundaries. The goal is to infer the unknown diffusion coefficient κ d (s; x) from incomplete observations of the potential function u(s). Here we adopt the same setup used in [17,56].

Posterior density
The unknown diffusion coefficient κ d (s; x) is parametrized by a d-dimensional random variable x. We take each of the parameters x k , k = 1, ..., d, to be uniformly distributed on [− √ 3, √ 3]. Then, for any x ∈ [− √ 3, √ 3] d and s = (s 1 , s 2 ) ∈ D, the logarithm of the diffusion coefficient at s is defined by the following expansion where with τ(k) = 1 2 ( 1 + k/2 − 1) . To discretise the PDE in (62), we tessellate the spatial domain D with a uniform Cartesian grid with mesh size h. Then, we replace the infinite dimensional solution u ∈ V ≡ H 1 (D) by the continuous, piecewise bilinear finite element (FE) approximation u h ∈ V h associated with the discretisation grid. To find u h , we solve the resulting Galerkin system using a sparse direct solver. A fixed discretisation with d = 11, h = 2 −6 , and ν = 2 is used in this example.
The observed data y ∈ R m consist of m local averages of the potential function u(s) over subdomains D i ⊂ D, i = 1, ..., m. To simulate the observable model outputs, we define the forward model The subdomains D i are squares with side length 2/( √ m + 1) centred at the interior vertices of a uniform Cartesian grid on D = [0, 1] 2 with grid size 1/( √ m + 1), which form an overlapping partition of D. Synthetic data for these m local averages are produced from the "true" parameter x true = (1.5, ..., 1.5) by adding i.i.d. zero mean normally distributed noise with the standard deviation σ . This way, we have the unnormalized posterior density

Numerical results
In this example, we compare the impact of different tempering schemes, different numbers of measurements, and different measurement noise levels on DIRT. We also compare different basis functions used in the DIRT construction. In all experiments, we feed N = 2 16 independent samples generated by DIRT to both IRT-MCMC and IRT-IS.
In Figure 8, we compare DIRT with three different tempering sequences β = [β 0 , ..., β L ], varying the grid size n and the TT ranks R max . Note that with L = 0 we have the single-layer SIRT. The reported number of density function evaluations is a sum of the numbers of evaluations in TT-cross at With the noise variance σ 2 = 10 −2 and a rather small data size m = 3 2 , the posterior density is relatively simple to characterise, and hence can be tackled directly using the single-layer SIRT (see the case L = 0 in Figure 8 and [17]). However, the multilayer DIRT uses much smaller number of collocation points and TT ranks for producing an approximate posterior density with the same accuracy. Here the 3-layer DIRT needs only 10% of the density evaluations required for the singlelayer counterpart.  Next, we test the multilayer DIRT on more difficult posterior densities, with larger numbers of measurements and smaller observation noise. We set the number of collocation points to be n = 16, maximum TT-cross iteration to be MaxIt = 1, and maximum TT rank to be R max = 12. In Figure 9, we fix σ 2 = 10 −2 and vary the number of measurements. Since halving the measurement grid size 2/( √ m + 1) corresponds to multiplying m by approximately a factor of 4, we use a different tempering strategy, starting with β 0 = 4 − log 4 m , and setting β k+1 = 4 · β k for next layers. This way, the number of layers grows proportionally to log m, and the number of density evaluations in TT-cross for fixed TT ranks is also proportional to log m, which can be confirmed by Figure 9 (a). Here we can see that the Fourier basis is significantly more accurate than the piecewise-linear basis for the same grid size. With the linear basis, both IACT and N/ESS(N) grow logarithmically in the number of measurements. With the Fourier basis, the IACT stays almost constant below 1.5 and the N/ESS(N) stays almost constant below 1.1, increasing slightly only for the most difficult case with m = 31 2 (Figure 9 (b) and (c)). With increasing number of measurements, the likelihood becomes more concentrated. This makes it more challenging to characterise the posterior using prior-based approaches such as QMC [56] or singlelayer TT approximation. For example, even with a much larger number of collocation points n = 65 and 5 iterations of TT-cross (giving a maximal TT rank of 41), we still can not produce reasonable results for m = 15 2 with the single-layer SIRT.
We carry out an additional test with decreasing noise variance σ 2 . In Figure 10, we fix m = 15 2 and vary σ 2 from 10 −1 to 10 −5 . In this experiment, fixing TT ranks becomes insufficient for representing posterior densities with low observation noise. In particular, the piecewise linear basis does not have sufficient accuracy for the case of the smallest noise variance. In contrast, the Fourier basis can still retain low IACT and N/ESS(N) for low noise variance cases, where IACT and N/ESS(N) grow proportionally to log σ . Together with the log-scaling of the number of evaluations, the effective complexity of the entire IRT-MCMC and IRT-IS schemes becomes poly-logarithmic in the variance. Although the Fourier basis is computationally more expensive to evaluate than the piecewise-linear basis, with a factor of 2.5 in the worst case scenario in this experiment, this additional computational effort is well compensated by a much higher accuracy. This makes DIRT a viable approach for a range of concentrated distributions.

Parabolic PDE
In the fourth example, we consider an inverse problem of identifying the diffusion coefficient of a twodimensional parabolic PDE from point observations of its solution. In the problem domain D = [0, 3]× [0, 1], with boundary ∂ D, we model the time-varying potential function p(s,t) for given diffusion coefficient field κ d (s) and forcing function f (s,t) using the heat equation where T = 10. Parabolic PDEs of this type are widely used in modeling groundwater flow, optical diffusion tomography, the diffusion of thermal energy, and numerous other common scenarios for inverse problems. Let ∂ D n = {s ∈ ∂ D | s 2 = 0} ∪ {s ∈ ∂ D | s 2 = 1} denote the top and bottom boundaries, and ∂ D d = {s ∈ ∂ Ω | s 1 = 0} ∪ {s ∈ ∂ Ω | s 1 = 3} denote the left and right boundaries. For t ≥ 0, we impose the mixed boundary condition: where n(s) is the outward normal vector on the boundary. We also impose a zero initial condition, i.e., p(s, 0) = 0, ∀s ∈ D, and let the potential field be driven by a time-invariant forcing function with r = 0.05, which is the superposition of two normal-shaped sink/source terms centered at a = (0.5, 0.5) and b = (2.5, 0.5), scaled by a constant c = 5π × 10 −5 .

Posterior density
The logarithm of the diffusion coefficient, ln κ d (s; x), is endowed with the process convolution prior [29], where d = 27, lnκ = −5, each coefficient x k follows a standard normal prior N (0, 1) (which can be truncated to [−5, 5] with sufficient accuracy), and s (k) , k = 1, ..., d are centers of the kernel functions (shown as blue crosses in Figure 11 (a)). Similarly to the previous example, the potential function p(s,t) in (64) is approximated by p h (s,t) using the finite element method with piecewise bilinear basis functions and implicit Euler time integration. The observed data y ∈ R m×n T consist of the time-varying potential function p(s,t) measured at m = 13 locations (shown as black dots in Figure 11 (b) and (c)) at n T = 10 discrete time points equally spaced between t = 1 and t = 10. To simulate the observable model outputs, we define the forward model G h : X → R m×n T with G h i, j (x) = p h (s i ,t j ; x), i = 1, ..., m, j = 1, ..., n T .
Using a "true" parameter x true drawn from the prior distribution and a forward model with h = 1/80, synthetic data y ∈ R m×n T are produced by adding i.i.d. normal noise with zero mean and the standard deviation σ = 1.65 × 10 −2 to G h (x true ). The corresponding ln κ d (s; x true ) and the simulated potential function at several time snapshots are shown in Figure 11. The standard deviation σ = 1.65 × 10 −2 corresponds to a signal-to-noise ratio of 10. This way, we have the unnormalized posterior density

Numerical results
To construct DIRT, we employ a geometric grading in β , refining towards 1, The posterior is very concentrated in this example, so we employ separate tempering of prior and likelihood in the bridging densities, in which a weakly tempered prior is used. We use a truncated normal reference measure on the domain (−4, 4] d with the Fourier basis to build DIRT. In TT-cross, a maximum iteration MaxIt = 1 without enrichment (Rho = 0) is used. The number of collocation points in each dimension is set to be n = 16 and the TT ranks are chosen to be R0 = R max = R k , where 15, 15, 15, 15, 15, 13, 9, 9, 8, 7} at the k-th layer of DIRT. The PDE in (64) is computationally expensive to solve. Here our goal is to explore the posterior density defined by a forward model, G h f , with refined grid size h f = 1/80. A coarse forward model, G h c with h c = 1/20, and an intermediate forward model, G h m with h m = 1/40, are used in defining the bridge densities to speed-up the DIRT construction. This multilevel construction shares similarities with the multi-fidelity preconditioning strategy of [51], except that DIRT is based on TT rather than optimisation and our multilevel models are blended into the bridging densities. In numerical experiments, we consider the CPU time of solving the coarse model evaluation as one work unit. The CPU times for evaluating the intermediate model and the fine model are about 12.5 work units and 160 work units, respectively.
In the first experiment, we employ the coarse forward model, G h c , to compare the sampling performance of DIRT with that of DRAM. The results are reported in Figure 12 (a), where the number of independent samples is calculated as the length of the Markov chain divided by the estimated IACT. The estimated IACTs for DRAM and DIRT are about 132 and 3.04, respectively, and the importance sampling with DIRT produces ESS = N/1.5. For DRAM, we exclude the burn-in samples in the number of work units, whereas the number of work units for the DIRT includes the construction cost of DIRT (993392 density evaluations). In this experiment, despite the high construction cost, DIRT can generate a Markov chain with almost independent samples, which is significantly more efficient than DRAM. Furthermore, the construction cost of DIRT will be less significant if one needs to generate more posterior samples, as shown in Figure 12   In the second experiment, we demonstrate the construction of DIRT using not only the bridge densities with different temperatures, but also the forward models with different grid resolutions. For initial temperatures such that β k < 10 −0.5 , we use the coarse forward model G h c . For β k = 10 −0.5 and β k = 10 −0.25 , we use the intermediate forward model G h m . For β k = 1 we use the fine forward model G h f , so that the fine model is used to define the target posterior density. We need 915024, 58544, and, 19824 evaluations of the coarse, intermediate, and fine models, respectively, to construct DIRT. Once the DIRT is constructed, Algorithm 1 generates a Markov chain with IACT 2.87 that samples the posterior defined by the fine model. Again, the importance sampling is more efficient with ESS = N/1.78. The number of independent samples versus the number of work units is reported in Figure 12 (b). In this experiment, it is computationally infeasible to apply DRAM directly (or any MCMC in general) to sample the posterior defined by the fine model. In contrast, the evaluation of DIRT and the corresponding posterior densities can be embarrassingly parallelised, which can further accelerate the posterior inference using high-performance computers. The IRT-IS algorithm can bypass the construction of Markov chains, which makes it suitable to be integrated into multilevel Monte Carlo or multilevel quasi Monte Carlo estimators to improve the convergence rate of the computation of posterior expectations. We leave this as a future research question.

Conclusion
We have enabled functional tensor decompositions of complicated and concentrated continuous probability density functions that suffer from impractically large tensor ranks when approximated directly. Instead, we build an adaptive sequential change of coordinates that drives the target density towards a product function. This change of variables is realised by the composition of order-preserving SIRTs computed from functional TT decompositions of ratios of bridging densities. Each of the ratio functions recovers one scale of correlations of the target density, and hence it can be approximated with fixed TT ranks. Together with the triangular structure of the Rosenblatt transport, this makes the total complexity linear in the number of variables.
This deep composition of the inverse Rosenblatt transports shares similarities with deep neural networks with nonlinear activation functions. However, DIRT has several advantages.
-Each DIRT layer, defined by the bridging densities, can be associated with the scale of noise or observation function. Any prior knowledge of model hierarchies can improve the selection of bridging densities. In contrast, the influence of a particular fully-connected layer in a neural network is difficult to predict or understand. -DIRT layers can be computed independently. As soon as the layer is approximated up to the desired accuracy, it can be saved and never recomputed again. This enables a simple interactive construction, where the tuning parameters can be set layer per layer. Neural networks require optimisation of all layers simultaneously. -The construction of each DIRT layer is powered by efficient TT-cross algorithms, which can converge much faster than the stochastic gradient descent used by neural networks in many cases. The dense linear algebra operations used by TT decompositions can take full advantage of modern CPU and GPU vectorisations, whereas an embarrassing parallelism with respect to target density evaluations is well scalable to modern high performance computers.
Definition 1 For each variable x k , we consider a set of interpolation basis functions that can be represented by a vector-valued function and a set of collocation points X k = {x is an identity matrix. A typical construction is the (piecewise) Lagrange basis functions defined by a point set X k . We can also construct the interpolation basis from other basis functions of a separable Hilbert space, denoted by and a point set X k with a nonsingular Vandermonde matrix by setting Specifically, if Ψ k (x k ) is a set of λ k -orthogonal functions and X k are the roots of the function ψ (n k +1) k (x k ), we recover the pseudo-spectral methods and have where the vector ω k ∈ R n k contains quadrature weights associated with X k and diag(·) brings a vector into a diagonal matrix.
Furthermore, we define the mass matrix We let L k ∈ R n k ×n k be the Cholesky factor of the mass matrix, i.e., L k L k = M k . For an interpolation basis constructed from λ k -orthogonal functions and the roots of ψ (n k +1) k (x k ), we have M k = diag(ω k ).

Two dimensional case
Consider first the TT decomposition of a bivariate function where the rank-r cores are specified by basis functions Φ 1 (x 1 ) ∈ R 1×n 1 and Φ 2 (x 2 ) ∈ R 1×n 2 , and the corresponding coefficient matrices A 1 ∈ R n 1 ×r and A 2 ∈ R r×n 2 , respectively. We aim to recover A 1 and A 2 such that the L 2 norm of the error , is minimised. Note that with interpolation bases, the matrices A 1 and A 2 are also pointwise evaluations of the functions H 1 (x 1 ) and H 2 (x 2 ) at collocation points X 1 and X 2 , respectively. This way, h(x 1 , x 2 ) yields a discrete approximation where h(X 1 , 2 )] ∈ R n 1 ×n 2 for x (i) 1 ∈ X 1 and x ( j) 2 ∈ X 2 is the matrix of nodal values of h(x 1 , x 2 ) similarly to (68). This way, the L 2 error of the continuous factorisation yields a discrete approximation Thus, we can recover the matrices A 1 and A 2 by solving some low-rank matrix decomposition of h(X 1 , X 2 ). However, assembling the matrix h(X 1 , X 2 ) requires evaluating the function h(x 1 , x 2 ) at the Cartesian union of the collocation points X 1 × X 2 , which can be computationally prohibitive in the generalisation to d > 2. Instead, we can use some interpolation point sets X 1 ⊂ X 1 and X 2 ⊂ X 2 of cardinality #X 1 = #X 2 = r such that the matrix h(X 1 , X 2 ) ∈ R r×r is nonsingular, and rank-r interpolation cores with B 1 ∈ R n 1 ×r and B 2 ∈ R r×n 2 , to approximate h(x 1 , x 2 ) by interpolation. The interpolation cores satisfy the property that G 1 (X 1 ) and G 2 (X 2 ) are identity matrices. This yields interpolated approximations to h(x 1 , x 2 ), for example, This way, the goal becomes identifying the optimal point sets (X 1 , X 2 ) and the cores (G 1 , G 2 ) that minimise the interpolated rank-r factorisation error In practice, an alternating direction strategy can be employed to solve the above nonlinear minimisation problem via a sequence of subproblems at a lower computational cost compared to that of the full matrix decomposition induced by (71). For example, we start from some initial guess of B 2 and X 2 to solve for B 1 and X 1 via the minimisation problem B 1 , X 1 = arg min then we use the updated B 1 and X 1 to renew B 2 and X 2 via B 2 , X 2 = arg min and repeat until convergence. Given the collocation points X 1 and X 2 , the coefficient matrices B 1 and B 2 satisfy a simple quadratic optimisation, and can be computed from respectively. Solving (74) only requires (n 1 + n 2 − r)r evaluations of h(x 1 , x 2 ). In (74), one needs to find the interpolation point sets X 1 and X 2 so that the resulting interpolation operator is an optimal approximation to the projection operator that spans the same linear subspace. However, finding the optimal interpolation point sets is an NP-hard problem. In practice, accurate quasi-optimal solutions can be obtained by greedy algorithms such as the (discrete) empirical interpolation [1,8] or the maximum volume (MaxVol) [22,23,24] methods. Here we outline the procedure of the MaxVol algorithm [22] for solving (73), which can be equivalently expressed as the problem of searching for an index set I ⊂ {1, ..., n} of cardinality r such that the norm of B = HH −1 ∈ R n×r is minimized. Here H = H[I, :] ∈ R r×r is the submatrix of a given matrix H in the MATLAB notation. For example, one can have H = h(X 1 , X 2 ), and then the interpolation point set X 1 is given by I and the coefficient matrix is set by B 1 = B. Given an initial index set, which can be chosen as the r dominant pivots from Gaussian elimination, MaxVol proceeds as Algorithm 3. Note that B[I, :] ∈ R r×r is an identity matrix by construction. MaxVol ensures that no other row is more "important" by searching for a dominant submatrix H such that |B[i, j]| ≤ 1 + δ , which is a proxy to the maximum volume submatrix H such that | det(H )| = max I | det(H[I, :])|. The update of B can be computed efficiently via the Sherman-Morrison-Woodbury formula [22] with a total cost of O(nr 2 ) per iteration.
For the numerical stability it is beneficial to compute the thin generalised QR factorization HR = h(X 1 , X 2 ), where the matrix H has M 1 -orthonormal columns. This way, the set of functions Φ 1 (x 1 )H forms a λ 1 -orthogonal basis. The factorisation HR = h(X 1 , X 2 ) can be obtained by the thin QR factorization QR = L 1 h(X 1 , X 2 ) and H = L − 1 Q. Then, one can apply MaxVol to H, which is the evaluation of Φ 1 (x 1 )H at X 1 , to select the index set I, and thus the interpolation points X 1 ⊂ X 1 . We have H[I, :] = h(X 1 , X 2 )R −1 , which yields B = HH[I, :] −1 = h(X 1 , X 2 )h(X 1 , X 2 ) −1 . Thus, we can set the coefficient matrix as B 1 = B and define the interpolation core G 1 (x 1 ) = Φ 1 (x 1 )B 1 such that G 1 (X 1 ) is an identity matrix.
We can obtain (B 1 , X 1 ) and (B 2 , X 2 ) by applying MaxVol within alternating iterations. Then, we can set A 1 = B 1 and A 2 = h(X 1 , X 2 )B 2 to recover the factorisation in the form of (69).

Multi-dimensional case
The TT-cross algorithm [46] recursively extends (74) to d > 2. In the first step, we assume that a reduced point set X >1 = {(x (α 1 ) 2 , ..., x (α 1 ) d )} of r 1 points is given. We can for example draw it from some tractable reference measure. We compute an analogue of the first equation in (74) is a matrix filled with the function h(x) evaluated at the "reduced" set of points X 1 × X >1 . Now we apply MaxVol to compute reduced subsets I 1 ⊂ {1, ..., n 1 } and X <2 = X 1 (I 1 ) ⊂ X 1 . Similarly to the matrix B 1 in the two dimensional case, we let the actual TT core be the "stabilized" matrix A 1 = H 1 H 1 [I 1 , :] −1 , where H 1 R 1 = h(X 1 , X >1 ) is the generalised QR decomposition.
In the k-th step, we assume reduced point sets X <k = {x d } are given. We can compute a third order tensor which consists of evaluations of h(x) at the Cartesian union of the sets X <k × X k × X >k . We let X <1 = X >d = / 0 to enable the notation for all k. We can unfold H k into matrices of the form H (L) k ∈ R (r k−1 n k )×r k , and H (R) k ∈ R r k−1 ×(n k r k ) , such that The union of the indices α k−1 and i k corresponds to the union of the point sets X ≤k := X <k × X k .
Therefore, we can apply MaxVol to H (L) k (or a generalised QR factor thereof) to obtain a discrete set I k ⊂ {1, ..., r k−1 n k }, and take a subsample of X ≤k for the next recursion step, X <k+1 = X ≤k (I k ). Similarly for the kth TT core we define If the function h(x) admits an exact TT decomposition, and the initial point sets were chosen such that all H (L) k are full-rank, the recursion defined above reconstructs the decomposition exactly. However, in practice the initial point sets can be a poor interpolation sets. In this case we can refine them by carrying out several iterations. Having computed A d , we reverse the recursion and iterate backwards, computing discrete sets J k ⊂ {1, ..., n k r k } via MaxVol applied to (H (R) k ) , and setting X >k−1 = X ≥k (J k ), where X ≥k = X k × X >k .
The second key ingredient is the adaptation of TT ranks. The TT ranks can be easily reduced. For example, it is sufficient to replace the generalised QR factorization of H (L) k or (H (R) k ) by a generalised SVD, where the singular values below the desired threshold are truncated. To increase the TT ranks, we can apply oversampling. Using the forward iteration (with k increasing) as an example, we can compute the tensor H k ∈ R r k−1 ×n k ×(r k +ρ k ) on the enriched point set X <k × X k × (X >k ∪ X >k ), where X >k = {(x (α k ) k+1 , ..., x (α k ) d )} ρ k α k =1 are auxiliary points. These auxiliary points can be sampled at random [45], or more accurately, from a surrogate of the error [18]. In the latter case, we carry out a second TT-cross to approximate the error h(x) −h(x) by a TT decomposition with TT ranks ρ 1 , ..., ρ d−1 , and take the MaxVol points of the error as X >k . This enrichment of the solution with error or residual information has proven to accelerate the convergence drastically even for small expansion ranks ρ k when applied to solving linear systems [18]. The pseudocode of the TT-cross is provided in Algorithm 4.

Algorithm 4 TT-cross
The tensor in (75) suggests that the TT-cross requires ∑ d k=1 r k−1 n k r k evaluations of h(x) per iteration, which is proportional to the number of unknowns in the TT cores. To enhance the robustness (at the expense of a larger number of function evaluations), one may oversample X k beyond n k basis functions, and use the rectangular MaxVol algorithm [42] to oversample I k , J k+1 beyond r k indices. In this case, the matrix inverse in (77) is replaced by a pseudoinverse. For our DIRT framework, the standard MaxVol equipped with the error enrichment is sufficiently robust to factorise the ratio functions, so we proceed with Algorithm 4.
where M k ∈ R n k ×n k is the symmetric positive definite mass matrix defined in (22). Given the Cholesky decomposition L k L k = M k , we have Substituting the above identity into (83), we have Denoting and unfolding C k along the first coordinate similarly to (76) to obtain a matrix C (R) k ∈ R r k−1 ×(n k r k ) , we have Equivalently, we have M k = C (R) k