Approximation and sampling of multivariate probability distributions in the tensor train decomposition

General multivariate distributions are notoriously expensive to sample from, particularly the high-dimensional posterior distributions in PDE-constrained inverse problems. This paper develops a sampler for arbitrary continuous multivariate distributions that is based on low-rank surrogates in the tensor-train format. We construct a tensor-train approximation to the target probability density function using the cross interpolation, which requires a small number of function evaluations. For sufficiently smooth distributions the storage required for the TT approximation is moderate, scaling linearly with dimension. The structure of the tensor-train surrogate allows efficient sampling by the conditional distribution method. Unbiased estimates may be calculated by correcting the transformed random seeds using a Metropolis--Hastings accept/reject step. Moreover, one can use a more efficient quasi-Monte Carlo quadrature that may be corrected either by a control-variate strategy, or by importance weighting. We show that the error in the tensor-train approximation propagates linearly into the Metropolis--Hastings rejection rate and the integrated autocorrelation time of the resulting Markov chain. These methods are demonstrated in three computed examples: fitting failure time of shock absorbers; a PDE-constrained inverse diffusion problem; and sampling from the Rosenbrock distribution. The delayed rejection adaptive Metropolis (DRAM) algorithm is used as a benchmark. We find that the importance-weight corrected quasi-Monte Carlo quadrature performs best in all computed examples, and is orders-of-magnitude more efficient than DRAM across a wide range of approximation accuracies and sample sizes. Indeed, all the methods developed here significantly outperform DRAM in all computed examples.


Introduction
quadrature becomes the only possibility, with the quadrature error depending on the particular distribution of the samples. When the target density function admits a TT approximation with a modest storage, the cumulative transform method can produce optimally distributed samples at a low cost. Secondly, even when a fast growth of the TT storage prevents accurate computation of the density function, the TT-surrogate distributed samples can still be used as proposals in the MH algorithm, or with importance weighting. Even a crude approximation to the PDF with 10% error can produce the acceptance rate of 90% and the integrated autocorrelation time of 1.2, which is close enough to the best-possible practical MCMC. The relationship between approximation error and acceptance rate is formalized in Section 4.2.
The paper is structured as follows: In Section 2 we review the conditional sampling method used to sample from the multivariate TT-interpolated approximation. Some background on the TT decomposition is presented in Section 3. A Metropolised algorithm that uses the TT surrogate for sampling from the target distribution is presented in Section 4, as well as methods for unbiased quadrature that utilize a two-level algorithm, importance weighting, and quasi-Monte Carlo seed points. Several numerical examples are presented in Section 5: Section 5.1 shows posterior estimation of a shock absorber failure probability; Section 5.2 demonstrates efficient sampling when the Rosenbrock function is the log target density, that is a synthetic 'banana-shaped' PDF that presents difficulties to random-walk MCMC samplers; and Section 5.3 demonstrates posterior inference in a classical inverse problem in subsurface flow. In each of the numerical examples, scaling for the TTbased sampling and quadrature is shown, with comparison to DRAM [21], as well as (in Section 5.3) to direct quasi-Monte Carlo quadrature.

Conditional distribution sampling method
The conditional distribution method [4,24,23] reduces the task of generating a d-dimensional random vector into a sequence of d univariate generation tasks.
We use the inverse cumulative transformation method for each univariate sample, and thus implement the (inverse of the) Rosenblatt transformation [42] from the d-dimensional unit cube to the state-space of π. The standard conditional distribution method seeds the transformation with independent samples distributed uniformly in the unit cube, to produce independent draws from π. This generalizes the inverse cumulative transformation method for univariate distributions. Later, we will also seed this transformation with quasi-random points to implement quasi-Monte Carlo quadrature for evaluating expectations with respect to π.
When the analytic inverse of each univariate cumulative distribution function is not available, a straightforward numerical procedure is to discretize the univariate density on a grid, with approximate sampling carried out using a polynomial interpolation. In that case, the normalization, i.e., the denominator in (1), is not necessary as normalization of the numerical approximation is evaluated, allowing sampling from an un-normalized marginal density (2), directly.
The main difficulty with the conditional distribution method for multi-variate random generation is obtaining all necessary marginal densities, which requires the high-dimensional integral over x k+1 . . . x d in (2). In general, this calculation can be extremely costly. Even a simple discretization of the argument of the marginal densities (2), or the conditional-marginal densities (1), leads to exponential cost with dimension.
To overcome this cost, we precompute an approximation of π(x 1 , . . . , x d ) in a compressed representation that allows fast computation of integrals in (2), and subsequent sampling from the conditionals in (1). In the next sections, we introduce the TT decomposition and the related TTcross algorithm [40] for building a TT approximation to π. Moreover, we show that the separated form of the TT representation allows an efficient integration in (2), with cost that scales linearly with dimension.
The TT decomposition natively represents a tensor, or d-dimensional array of values. The function approximation (3) is obtained by first approximating the tensor that results from discretizing the PDF π(x 1 , . . . , x d ) by collocation on a tensor product of univariate grids. Let x i k k ∈ R, with i k = 1, . . . , n k and x 1 k < · · · < x n k k , define independent univariate grids in each variable, and let with TT blocksπ (k) . Each TT block is a collection of r k−1 r k vectors of length n k , i.e.,π (k) (i k ) = π (k) (x i k k ) is a three-dimensional tensor of size r k−1 × n k × r k . If we assume that all n k ≤ n and r k ≤ r for some uniform bounds n, r ∈ N, the storage cost of (4) can be estimated by dnr 2 which is linear in the number of variables. In contrast, the number of elements in the tensor of nodal valueŝ π(i 1 , . . . , i d ) grows exponentially in d and quickly becomes prohibitively large with increasing d.
The continuous approximation of π (3) is given by a piecewise polynomial interpolation of nodal values, or TT blocks. For example, in the linear case we have which induces the corresponding multi-linear approximationπ of π in (3). If the individual terms π (k) α k−1 ,α k (x k ) are normalized PDFs, the TT approximation in (3) may be viewed as a mixture distribution. However, the TT decomposition can be more general and may also include negative terms. Moreover, at some locations where π(x) is close to zero the whole approximationπ(x) may take (small) negative values. This will be circumvented by explicitly taking absolute values in the conditional distribution sampling method, see Sec. 4.1.
The interpolated TT approximation to π in (3) required several choices. First a coordinate system must be chosen, then an ordering of coordinates, then a rectangular region that contains the (appreciable) support of the PDF, and then univariate grids for each coordinate within the rectangular region. Each of these choices affects the TT ranks, and hence the efficiency of the TT representation in terms of storage size versus accuracy of the approximation, that is also chosen; see later. In this sense, the sampler that we develop is not 'black box'. However, as we demonstrate in the computed examples, an unsophisticated choice for each of these steps already leads to a computational method for sampling and evaluating expectations that is substantially more efficient than existing MCMC algorithms. Smart choices for each of these steps could lead to further improvements.
The rationale behind the independent discretization of all variables is the rapid convergence of tensor product Gaussian quadrature rules. If π(x) is analytic with respect to all variables, the error of the Gaussian quadrature converges exponentially in n. A straightforward summation of n d quadrature terms would imply a cost of O(| log ε| d ) for accuracy ε. In contrast, the TT ranks often depend logarithmically on ε under the same assumptions on π(x) [51,25,46], leading to O(d| log ε| 3 ) cost of the TT integration, since the integration of the TT decomposition factorizes into one-dimensional integrals over the TT blocks. This can also be significantly cheaper than the O(ε −2 ) cost of Monte Carlo quadrature.
In general, it is difficult to deduce sharp bounds for the TT ranks. Empirically, low ranks occur in the situation of "weakly" dependent variables. For example, if x 1 , . . . , x d correspond to independent random quantities, the PDF factorizes into a single product of univariate densities, which corresponds to the simplest case, r = 1 in (3). Thus, a numerical algorithm that can robustly reveal the ranks is indispensable.

TT-cross approximation
A quasi-optimal approximation ofπ for a given TT rank, in the Frobenius norm, is available via the truncated singular value decomposition (SVD) [37]. However, the SVD requires storage of the full tensor which is not affordable in many dimensions. A practical method needs to be able to compute the representation (3) using only a few evaluations of π. A workhorse algorithm of this kind is the alternating TT-cross method [40]. That builds on the skeleton decomposition of a matrix [18], which represents an n × m matrix A of rank r as the cross (in MatLab-like notation) of r columns and rows, where I and J are two index sets of cardinality r such that A(I, J ) (the intersection matrix) is nonsingular. If r n, m, this decomposition requires computing only (n+m−r)r nm elements of the original matrix. The SVD may be used for choosing the cross (5), though with greater cost, as noted above.
The TT-cross approximation may be constructed, at least in concept, by reducing the sequence of unfolding matricesπ k = [π(i 1 , . . . , i k ; i k+1 , . . . , i d )], that have the first k indices grouped together to index rows, and the remaining indices grouped to index columns. We begin withπ 1 .
Assume that there exists a set of r 1 (d−1)-tuples, , such that the vectorŝ π(:, I >1 ) form a "good" basis for the rows ofπ 1 , i.e., in the i 1 variable. The reduction (5) may be formed for r 1 rows indexed by I <2 = {i α 1 1 } r 1 α 1 =1 , with the row index I <2 optimized by choosing the r 1 × r 1 submatrixπ(I <2 , I >1 ) of the maximum volume (modulus of determinant), by the maxvol algorithm [17] in O(nr 2 1 ) operations. The discrete TT blockπ (1) is the rectangular n × r 1 matrix π(:, I >1 )π(I <2 , is progressed to the induction of the TT cross algorithm. In a practical algorithm, these operations are actually carried out on the QR-decomposition of matrices [40], for numerical stability. In the k-th step, assume that there exists the reductionπ >k−1 (α k−1 , i k , . . . , i d ), a set of (k − 1)- can be seen as a r k−1 n × r k rectangular matrix and the maxvol algorithm can be applied, again, to produce a set of row positions k = {α α k k−1 , i α k k } r k α k =1 . The next set I <k+1 is obtained from k by replacing α k−1 with the corresponding indexes i This process can be also organized in a form of a binary tree, which gives rise to the so-called hierarchical Tucker cross algorithm [1]. In total, we need O(dnr 2 ) evaluations of π and O(dnr 3 ) additional operations in computations of the maximum volume matrices.
The choice of the univariate grids, x 1 k < · · · < x n k k , and initial index sets I >k can be crucial. In this paper we found that a uniform grid in each coordinate was sufficient, with even relatively coarse grids resulting in efficient sampling algorithms; see the numerical examples for details. Given any easy to sample reference distribution (e.g. uniform or Gaussian), it seems reasonable to initialize I >k with independent realizations of that distribution (we could also expand the grids with reference samples, though we did not do that). If the target function π admits an exact TT decomposition with TT ranks not greater than r 1 , . . . , r d−1 , and all unfolding matrices have ranks not smaller than the TT ranks of π, the cross iteration outlined above reconstructsπ exactly [40]. This is still a rare exception though, since most functions have infinite exact TT ranks, even if they can be approximated by a TT decomposition with a small error and low ranks. Nevertheless, the cross iteration, initialized with slightly overestimated values r 1 , . . . , r d−1 , can deliver a good approximation, if a function is regular enough [1,8].
This might be not the case for localized probability density functions. For example, for a heavytailed function (1 + x 2 1 + · · · + x 2 d ) −1/2 one might try to produce I >k from a uniform distribution in a cube [0, a] d with a sufficiently large a. However, since this function is localized in an exponentially small volume [0, ε] d , uniform index sets deliver a poor TT decomposition, worse for larger a and d.
In this situation it is crucial to use fine grids and refine the sets I <k , I >k by conducting several TT cross iterations, going back and forth over the TT blocks and optimizing the sets by the maxvol algorithm. For example, after computingπ (d) =π >d−1 , we "reverse" the algorithm and consider the unfolding matrices with indices {i Applying the maxvol algorithm to the columns of a r d−1 × n matrixπ (d) , we obtain a refined set of points I >d−1 = {i The recursion continues from k = d to k = 1, optimizing the right sets I >k , while taking the left sets I <k from the previous (forward) iteration. After several iterations, both I <k and I >k can be optimized to the particular target function, even if the initial index sets gave a poor approximation.
This adaptation of points goes hand by hand with the adaptation of ranks. If the initial cardi-Algorithm 1 TT cross algorithm for TT approximation of π. Input: Initial index sets I >k , rank increasing parameter ρ ≥ 0, stopping tolerance δ > 0 and/or maximum number of iterations iter max . Output: TT blocks of an approximationπ(x) ≈ π(x).

11:
end for 12: end while nalities r 1 , . . . , r d−1 were too large for the desired accuracy, they can be reduced to the proper values. However, we can also increase the ranks by computing the unfolding matrix π(I <k , i k ; i α k k+1 , . . . , i α k d ) on an enriched index set, {i α k k+1 , . . . , i α k d } are taken from I >k for α k = 1, . . . , r k , and also from an auxiliary set I aux >k for α k = r k + 1, . . . , r k + ρ. This increases the k-th TT rank from r k to r k + ρ. The auxiliary set can be chosen at random [36] or using a surrogate for the error [52,7]. The pseudocode of the entire TT cross method is listed in Algorithm 1. For uniformity, we let I <1 = I >d = ∅.
Empowered with the enrichment scheme, we can even move away from truncating ranks. Instead, we start with a low-rank initial guess and increase the ranks until the desired accuracy is met. We have found that this approach is more accurate in numerical experiments.

Conditional Distribution Sampling (TT-CD)
One of the main contributions of this paper is to show that conditional distribution method is feasible, and efficient, once a PDF has been put into TT format. This section presents those calculations.
First we describe computation of the marginal PDFs (2), given π in a TT format (3). Notice that integrals over the variable x p appear in all conditionals (2) with k < p. The TT format allows computing the r k−1 × 1 vector P k required for evaluating the marginal PDF p k−1 by the following algorithm.
Since π (k) (x k ) ∈ R r k−1 ×r k for each fixed x k , the integral π (k) (x k )dx k is a r k−1 × r k matrix, where α k−1 is the row index, and α k is the column index. Hence, we can write Line 3 as the matrix-vector product,

11:
Draw a sample for the x k component,

13:
end for 14: end for Assuming n quadrature points for each x k , and the uniform rank bound r k ≤ r, the asymptotic complexity of this algorithm is O(dnr 2 ).
The first marginal PDF is approximated by We take the absolute value because the TT approximationπ (and hence, π (1) (x 1 )P 2 ) may be negative at some locations. In the k-th step of the sampling procedure, the marginal PDF also requires the first k − 1 TT blocks, restricted to the components of the sample that are already determined 3 , However, since the loop goes sequentially from k = 1 to k = d, the sampled TT blocks can be accumulated in the same fashion as the integrals P k . The overall method for drawing N samples is written in Algorithm 2. Note that ifπ is negative at any points, the actual density π * at x , which is the product of marginal PDFs computed in each step, may slightly differ fromπ, see Sec. 4.2.
The sample-independent prefactor of the marginal PDF in Line 7 requires O(dnr 2 ) operations. After that, the marginal PDF in Line 9 can be computed with O(dN nr) cost. Computation of the CDF in Line 10 depends on the quadrature scheme used. Using a piecewise spline approximation or the barycentric Gaussian formula leads to the linear cost O(dN n) for both C k and C −1 k . Complexity of computing the conditional PDF values Φ k+1 depends on howπ is interpolated onto x k . The global Lagrange interpolation requires O(nr 2 ) cost per sample, but the local interpolation is free of the n-term, needing only O(r 2 ) operations. In our numerical experiments we have found the piecewise linear interpolation on a uniform grid to be sufficient, so we end up with the latter estimate. In summary, the total complexity is O dr(nr + N (n + r)) , which is linear in d.

TT-CD with Metropolis-Hastings correction (TT-MH)
For the TT-CD sampling procedure in Alg. 2 to be fast, the TT ranks r should be as small as possible. Since the joint PDF is typically a complicated multivariate function, its TT ranks may grow fast with the increasing accuracy. This motivates the use of a coarse TT approximation with the TT-CD sampling used as an independence proposal in the Metropolis-Hastings (MH) algorithm 4 , to 'correct' the distribution and ensure that samples are distributed as the target distribution π. When the current state is x and the new proposal is x , the next state is determined by the stochastic iteration that first computes the Metropolis-Hastings ratio and the proposal is accepted with probability putting the new state x = x , otherwise x is rejected and the chain remains at x. We consider the acceptance rate and the integrated autocorrelation time as efficiency indicators of this MCMC algorithm. In this section, we study how they depend on the approximation error in the PDF. Throughout we must assume that π is absolutely continuous with respect to π * , that guarantees reversibility with respect to π [50], and that we can evaluate the importance ratio w(x) = π(x)/π * (x). We require that w * ≡ ess sup x w(x) is finite, which is equivalent to uniform geometric convergence (and ergodicity) of the chain [41]. (The essential supremum may be taken with respect to π or π * .) Lemma 1. Suppose that the mean absolute error in the TT-CD sampling density satisfies Then the rejection rate is bounded by ε, i.e., where the expectation is taken over the chain.
Proof. Using ergodicity of the chain, where the second step uses the triangle inequality. Integrating both sides with respect to x and x , we obtain the claim of the lemma.
This lemma indicates that the rejection rate decreases proportionally to ε, which is the total error from the interpolation of discrete values of π on a grid, approximation of π by a low-rank TT decompositionπ, and taking the absolute values in Alg. 2, Line 9.
The latter error is of the order of the TT approximation error. The approximate marginal and hence the negativep k can only be those where The error of taking the modulus p * k −p k can then be estimated as follows, Plugging in the condition for negativity ofp k , we obtain Lemma 1 assumed a mean absolute error. We need the stronger statement of local relative error, w * < ∞, to bound the integrated autocorrelation time (IACT) [53], defined as where ρ gg (t) is the autocorrelation coefficient for the chain in statistic g at lag t. Defined like this, τ ≥ 1 can thus be considered as an overhead factor of a particular MCMC chain, compared to an ideal independent chain, asymptotic in the length of the chain.
For discrete state spaces, the result in Lemma 2 follows directly from [31, Eqn. (2.1)]; one could argue that this is sufficient for practical computation since computers are finite dimensional.
The TT cross method tends to introduce a more or less uniform error of magnitude ε on average. For regions where π(x) ε, this leads to a bounded importance ratio w(x) ≤ 1 + O(ε). When π(x) ε, we will typically have π * (x) = O(ε), and hence w(x) < 1. However, if a negative error of order ε is committed to π(x) ≈ ε, the two may cancel, resulting in a small π * (x), and consequently large w(x). Numerical experiments demonstrate that w * − 1 can indeed be much larger than the L 1 -norm error used in Lemma 1. However, these cancellations (and hence the equality in min(1/w(x), 1/w(y)) ≥ 1/w * ) seem to be rare. The practical IACT for coordinates (g = x) tends to be much smaller than the upper bound given by Lemma 2.

Improved Quadrature Points
Standard MCMC substitutes samples from the chain induced by a MH sampler, such as the TT-MH sampler, into a Monte Carlo quadrature to estimate expectations of statistics of interest. One downside of this MCMC approach is its slow convergence, with the quadrature error being of the order of N −1/2 for N samples.
It is tempting to use more structured quadrature points to give a better convergence rate. For example, the TT representation of π suggests the possibility of using that representation to reduce the required multi-variate integration to a sequence of uni-variate integrals, as we did when forming the marginal distributions in Sec. 4.1. Another option is to note that the TT-CD map is also well defined for arbitrary seed points, such as those taken from a quasi-Monte Carlo (QMC) lattice [33,19]. That is, Under certain assumptions on the smoothness of the quantity of interest, the QMC quadrature can give an error that reduces with order N −1 [5]. However, the deterministic origin of the QMC lattice induces two caveats: first, the QMC estimate is biased, and second, the MH accept/reject 'correction' step is not available as the acceptance probability would be exactly 0 (since the effective QMC proposal distribution is not absolutely continuous with respect to the continuous target π) [50].
In the following two sections, we develop methods for using QMC quadrature points to produce unbiased estimates. First in Sec. 4.3.1, the QMC quadrature is used as a kind of control variate, and then in Sec. 4.3.2 to estimate each term in the ratio defining the importance weights.
We will use C to denote the Rosenblatt transform associated with the approximate distribution π * . Hence the TT-CD Alg. 2 implements the mapping C −1 on seed points. We write QMC points for quadrature with respect to π * as C −1 q when {q } N =1 are taken from a QMC lattice in [0, 1] d .

Quasi-Monte Carlo Quadrature as a Control Variate (TT-qCV)
As noted above, implementing quasi-Monte Carlo quadrature by seeding the TT-CD algorithm with QMC points in [0, 1] d leads to improved convergence rates, but at the cost of the resulting estimates being biased. In this section, we use QMC quadrature as a kind of control variate, and evaluate the bias correction using a coupled TT-MH scheme, to give an unbiased estimate with reduced variance. We construct the unbiased estimate using a two-level version of the multilevel construction in [6]. Suppose we would like to compute an expectation E π g of some statistic g(x) w.r.t. the target probability density π(x). Considering also an expectation w.r.t. an approximate density π * (x), we can write the identity We evaluate the two terms by different quadratures, as follows.
The last term in (8) is approximated by QMC quadrature, using QMC points mapped via the TT-CD algorithm, where x 0 = C −1 q 0 according to Alg. 2 with q 0 being the QMC points on [0, 1] d . The first, parenthesized term in (8) is estimated using TT-MH, using the same uniformly-atrandom seed points for π and π * mapped via TT-CD, which are then corrected by the MH step with target π and π * , respectively, Since the MH step for π * does not reject the π * -distributed points, x π * are just the TT-CD mapped However, x π are subject to the accept/reject step w.r.t π acting on x π * . Each term in (10) is thus either zero, if x π * is accepted, or g(x −t π * ) − g(x π * ) if a rejection occurs.
Since (9) and (10) converge to the exact expectations when N 0 , N 1 → ∞, the total sum converges to E π g. However, the variance of the correction (10) can be much smaller than the variance of the expectation (9), and hence N 1 can be chosen smaller than N 0 . Similarly to Lemma 2, the variance of (10) over experiments can be estimated as if ε is the expected rejection rate. On the other hand, the QMC rule provides the variance of (9) of the order of |g| 2 /N 2 0 . For a desired relative quadrature error E q we can thus take Notice that the PDF accuracy ε serves as a variance reduction factor. The particular values of N 0 and N 1 can be determined by an adaptive greedy procedure [26], which compares empirical variances and costs of the two levels and doubles N in the level that has the maximum profit.

Quasi-Monte Carlo within Importance Weighting (TT-qIW)
Since the QMC points allow quadrature with respect to the approximate distribution π * , the bias in estimates may also be removed by importance re-weighting. Writing the expectation as an integral, then multiplying and dividing by the approximate density function, gives That is, the expected value of g with respect to π equals the expected value of the importanceweighted function g(x)π(x)/π * (x) with respect to the approximate density π * . Here, Z is the normalization constant for the target density π, that accommodates the possibility that π(x) is only available up to an unknown normalization constant, as is common in Bayesian hierarchical analyses. Transformed samples {x } N =1 serve as quadrature points with respect to the density π * . Therefore, we can approximate Note that, since x are π * -distributed, the denominator π * (x ) is positive with probability 1, and hence the importance quadrature (11) is well-defined. The convergence of (11) depends on the distance between π * and π and on the intrinsic properties of the samples x . For example, if x are produced from a QMC lattice, and the weighted statistic g(x)π(x)/π * (x) is smooth, we can expect the rate of convergence to be faster than N −1/2 .

Shock absorber reliability
In this section, we demonstrate our algorithm on a problem of reliability estimation of a shock absorber. The time to failure of a type of shock absorber depends on some environmental conditions (covariates) such as humidity, temperature, etc. We use data [35] on the distance (in kilometers) to failure for 38 vehicle shock absorbers. Since there were no values of any covariates in this example, the values of D covariates were synthetically generated from the standard normal distribution as this would correspond to the case in which the covariates have been standardized to have mean zero and variance equal to one. The accelerated failure time regression model [29] is widely used for reliability estimation with covariates. We use an accelerated failure time Weibull regression model, which was described as reasonable for this data in [29], where the density of time to failure is of the form and where θ 1 , θ 2 are unknown scale and shape hyperparameters, respectively. The covariates are assumed to affect the failure time distribution only through the scale parameter θ 1 , via a standard logarithmic link function, that is where x k are the covariates. The D + 2 unknown parameters β 0 , . . . , β D and θ 2 must be inferred from the observation data on the covariates x k and the failure times, which in this example are subject to right censoring (marked with + ). The set T f of failure times is given by: To perform Bayesian inference on the unknown parameters, we use the prior specifications in [20], namely an s-Normal-Gamma distribution over the parameters (β 0 , . . . , β D , θ 2 ) given by where γ = 2.2932, α = 6.8757, m 0 = log(30796), σ 2 0 = 0.1563, m 1 = · · · = m D = 0, σ 1 = · · · = σ D = 1, and Z is the normalization constant. The parameter ranges 13] are large enough to treat the probability outside as negligible.
The (unnormalized) Bayesian posterior density function is given by a product of Weibull probabilities, evaluated at each observation in T f , and the prior distribution, i.e.
The formula for the censored case arises from the fact that the contribution of a censored measurement is the probability that t exceeds the measured value, that is, We introduce n uniform discretization points in β 0 , . . . , β D and θ 2 and compute the TT cross approximation of the discretized density π(β 0 , . . . , β D , θ 2 ).
We consider two quantities of interest, the right 95% mean quantile and the right 95% quantile of the mean distribution, i.e.
respectively. The nonlinear constraint in the computation of the second quantile is solved by Newton's method. To estimate the quadrature error, we perform 32 runs of each experiment, and compute an average relative error over all runs, i.e., where ι and enumerate different runs.

Accuracy of the TT approximation and the CD sampler
We start by analysing the TT-MH sampling procedure, as described in Section 4.2. First, we consider how the errors inπ due to the tensor approximation and discretization propagate into the quality of the MCMC chain produced by the MH algorithm, i.e., the rate of rejections and the integrated autocorrelation time. The chain length is always set to N = 2 20 , and the results are averaged over 32 runs. We choose a relatively low dimensionality D = 2, since it allows us to approximate π up to a high accuracy. In Fig. 1, we vary the number of grid points n, fixing the stopping tolerance for the TT cross algorithm at δ = 10 −5 , as well as benchmarking the algorithm for different thresholds δ, fixing n = 512. We track the relative empirical standard deviation of the TT approximation, that can be computed exactly in the TT representation, a qIW approximation to the L 1 -norm error used in Lemma 1, and the essential supremum of the importance ratio w * , taken with respect to π * . As shown in Lemma 1, the rejection rate is expected to be proportional to the approximation error in L 1 norm, as this error goes to zero. The TT approximation is computed on a tensor grid , provided π is sufficiently smooth. We can see in Fig. 1 (top-left) that the rejection rate converges with O(n −2 ), suggesting that this is the case here. Bottom-left of Fig. 1 also suggests that the rejection rate is proportional to the TT approximation error when it is greater than the interpolation error. The behaviour of the importance ratio and the autocorrelation time is more complicated. The TT Cross algorithm tries to reduce the average approximation error. Pointwise relative error, however, is not guaranteed to be bounded. Although the maximal importance weight does decay to 1 as δ → 0, it still remains to be orders of magnitude larger than E L 1 . However, Lemma 2 seems to give a too pessimistic estimate for IACT, as the actual value τ − 1 is much smaller than w * and behaves similarly to the rejection rate.
The complexity of the TT cross algorithm (in terms of both the number of evaluations of π and the computational time) grows only very mildly (sublinearly) with δ and n (notice the log-polynomial scale in Fig. 1, right). This makes the TT approach also well scalable for high accuracies.

Convergence of the quantity of interest and comparison to DRAM
Now we investigate the convergence of the quantiles and compare TT-MH with the delayed rejection adaptive Metropolis (DRAM) algorithm [21]. The initial covariance for DRAM is chosen to be the identity matrix. In order to eliminate the effect of the burn-in period, we do not include the first N/4 elements of the DRAM chain in the computation of the quantiles. However, we will study the actual burn-in time empirically to have a fairer comparison of the "set-up cost" of the two methods.
First, in Table 1, we fix D = 6 covariates and vary the discretization grid n and the TT approximation threshold δ. We present the rejection rates and the IACTs for TT-MH, with n = 12, 16, and 32 grid points in each direction, using values of δ = 0.5 and δ = 0.05, as well as for DRAM. In addition, we also give the setup cost in terms of numbers of evaluations of π, i.e. the number of points needed to construct the TT approximation via the TT cross algorithm for TT-MH and the burn-in in DRAM. The latter is estimated as the point of stabilization of 6 moments of β and θ 2 , approximated by averaging over 2 14 random initial guesses. The coarsest TT approximation requires about 4 · 10 4 evaluations, whereas DRAM needs a burn-in of about 5 · 10 4 steps.
Next, in Fig. 2 (left) we show the estimate E q of the quadrature error defined in (13) for the two quantities of interest in (12), versus the total number N of samples in the MCMC chain, which is varied from 2 10 to 2 23 . We see that both MH methods (i.e. TT-MH and DRAM) converge with a rate of N −1/2 , as expected. To keep the set-up cost of the TT approximation low, we only consider fairly crude TT approximations (as in Tab. 1). However, all our approximations deliver a smaller sampling error for TT-MH than for DRAM when measured against the number of samples, and an even greater reduction when plotted against CPU time (Fig. 2, right). More accurate TT approximations require more evaluations of π during the set-up in TT Cross, up to 2.5 · 10 5 for  Fig. 2 (right). It exceeds the burn-in cost in DRAM. However, TT-MH is much faster than DRAM for the same number of evaluations, which yields a significant difference in terms of the total CPU time.
There are several reasons for this. For higher TT accuracies, the gains are mainly due to the significantly lower IACT of TT-MH, leading to a much better statistical efficiency of the MCMC chain. For low TT accuracies, the IACT of the TT-MH algorithm is still half of that for DRAM and in addition, there is some gain due to the reduced set-up cost. A further reason is the vectorization that is exploited in TT cross, where a block of O(nr 2 ) samples is evaluated in each step. In DRAM, the function needs to be evaluated point by point in order to perform the rejection. Therefore, the number of distinct calls to π in TT cross is much smaller than N , reducing the corresponding overhead in Matlab. In compiled languages (C, Fortran) on a single CPU, the difference may be less significant. However, parallel implementations will also benefit from the blocking, especially when each sample is expensive. If a high accuracy is needed for the estimation of the expected value, it is worthwhile to compute a more accurate TT approximation, since in that case the length of the MCMC chain will dominate the number of samples in the set-up phase.
In Fig. 2, we also present results with the TT-qIW approach described in Sec. 4.3.2, where the approximate density π * is used as an importance weight and where the expected value and the normalizing constant are estimated via QMC quadrature. In particular, we use a randomized rank-1 lattice rule with product weight parameters γ k = 1/k 2 . The generating vector that was used is available from Frances Kuo's website, namely file lattice-39102-1024-1048576.3600 at http://web.maths.unsw.edu.au/~fkuo/. Due to the non-smooth dependence of quantiles on the covariates, the rate of convergence for TT-qIW with respect to N is not improved in this example, but in absolute terms it consistently outperforms TT-MH, leading to even bigger gains over DRAM.
Finally, we fix the TT and the MCMC parameters to n = 16, δ = 0.05 and N = 2 22 and vary the number of covariates D, and hence the total dimensionality d = D + 2. In Fig. 3, we show the error in the quantiles, the number of evaluations of π, as well as the autocorrelation times. We see that the TT-MH approach remains more efficient than DRAM over a wide range of dimensions.

Rosenbrock function
As a benchmark example with particularly long tails (and hence potentially large autocorrelation times in MCMC), we consider the PDF induced by the Rosenbrock function The dimension d can be increased arbitrarily. The parameters for the TT approximation are chosen to be δ = 3 · 10 −3 and n = 128 for θ 1 , . . . , θ d−2 , n = 512 for θ d−1 and n = 4096 for θ d . Each θ k is restricted to a finite interval [−a k , a k ], where a d = 200, a d−1 = 7 and a k = 2 otherwise. Fig. 4 shows certain projections of N = 2 17 sampling points produced with TT-MH and DRAM for d = 32. We see that although the density function is reasonably compact and isotropic in the first variables, it is highly concentrated in the last variable. DRAM requires a significant number of burn-in iterations, which can be seen in Fig. 4 (middle and right) as the red cloud of samples that are not overlapped by blue ones. The difference is even more significant if we look at the integrated autocorrelation times in Tab. 2. In order to eliminate the burn-in in DRAM, we compute 2 20 samples and discard again the first quarter of the chain. We see that the IACT of TT-MH stays close to 1 for all considered dimensions, while it exceeds 100 for DRAM for larger d.

Inverse diffusion problem
Finally, we use our new TT-CD sampler to explore the posterior distribution arising from a Bayesian formulation of an infinite-dimensional inverse problem, as formalized in [48].
Let X and V be two infinite-dimensional function spaces -it is sufficient to consider separable Banach spaces -and let G : X → V be a (measurable and well-posed) forward map. Consider the inverse problem of finding κ ∈ X, an input to G, given some noisy observations y ∈ R m 0 of some functionals of the output u ∈ V . In particular, we assume a (measurable) observation operator where η ∈ R m 0 is a mean-zero random variable that denotes the observational noise. The inverse problem is clearly under-determined when m 0 dim(X) and in most mathematical models the inverse of the map G is ill-posed.
We do not consider prior modelling in any detail, and present here a stylized Bayesian formulation designed to highlight the computational structure and cost. We simply state a prior measure µ 0 , to model κ in the absence of observations y. The posterior distribution µ y over κ|y, the unknown coefficients conditioned on observed data, is given by Bayes' theorem for general measure spaces, where the left hand side is the Radon-Nikodym derivative, L is the likelihood function, and Z is the normalizing constant [48]. For computing, we have to work with a finite dimensional approximation κ d ∈ X d ⊂ X of the latent field κ such that dim(X d ) = d ∈ N, and define κ d as a deterministic function of a d-dimensional parameter θ := (θ 1 , . . . , θ d ). Typically, we require that κ d → κ as d → ∞, but we will not focus on that convergence here and instead fix d 1.
To be able to apply the TT representation, we set θ k ∈ [a k , b k ] with a k < b k , for all k = 1, . . . , d, and then κ d maps the tensor-product domain We denote by π 0 (θ) and π(θ) = π y (θ) the probability density functions of the pull-back measures of the prior and posterior measures µ 0 and µ y under the map κ d : Γ d → X d , respectively, and specify that map so that π 0 (θ) = 1/|Γ d |, i.e. the prior distribution over θ is uniform.
We can then compute TT approximations of the posterior density π(θ) as in the previous examples by using Bayes' formula (17), i.e.
Consider some quantity of interest in the form of another functional F : V → R of the model output G(κ d ). The posterior expectation of F , conditioned on measured y, can be computed as

Stylized elliptic problem and parametrization
As an example, we consider the forward map defined by the stochastic diffusion equation with Dirichlet boundary conditions u| x 1 =0 = 1, u| x 1 =1 = 0, and Neumann (zero flux) conditions otherwise [45]. Simulating the forward map requires solving this partial differential equation (PDE) (19), that depends on the unknown diffusion coefficient κ d ∈ X d ⊂ L ∞ (D), parameterized as above.
For this example, we take each of the parameters θ k , k = 1, . . . , d, to be uniformly distributed Then, for any θ ∈ Γ d and x = (x 1 , x 2 ) ∈ D, the diffusion coefficient at x is defined by the series expansion of ln κ d (θ, x), The expansion is similar to the one proposed in [9], and mimics the asymptotic behaviour of the Karhunen-Loève expansion of random fields with Matérn covariance function and smoothness parameter ν in two dimensions, in that the norms of the individual terms decay algebraically with the same rate. However, realizations do not have the same qualitative features and we use it purely to demonstrate the computational efficiency of our new TT samplers.
To discretize the PDE in (19) we tessellate the spatial domain D with a uniform rectangular grid T h with mesh size h. Then, we approximate the exact solution u ∈ V := H 1 (D) that satisfies the Dirichlet boundary conditions with the continuous, piecewise bilinear finite element (FE) approximation u h ∈ V h associated with T h . To find u h we solve the resulting Galerkin system using a sparse direct solver.
For this example, we take the observations to be m 0 noisy local averages of the PDE solution over some subsets D i ⊂ D, i = 1, . . . , m 0 , i.e., We take observation noise to be additive, distributed as i.i.d. zero-mean Gaussian noise with variance σ 2 n , giving the likelihood function, and posterior distribution In our experiments, the sets D i are square domains with side length 2/( √ m 0 + 1), centred at the interior vertices of a uniform Cartesian grid on D = [0, 1] 2 with grid size 1/( √ m 0 + 1), that form an overlapping partition of D. We consider an academic problem with synthetic data for these m 0 local averages from some "true" value θ * . In particular, we evaluate the observation operator at θ * = (θ 0 , , θ 0 , . . . , θ 0 ), for some fixed 0 = θ 0 ∈ (− √ 3, √ 3), and synthesize data by then adding independent normally distributed noise η * ∼ N (0, σ 2 n I), such that y = Q(G(θ * )) + η * .
We consider two quantities of interest. The first is the average flux at x 1 = 1. This can be computed as [49] F (G(θ)) = − where w h ∈ V h is any FE function that satisfies the Dirichlet conditions at x 1 = 0 and x 1 = 1. This formula for the average flux is a smooth function of θ, which ensures a fast convergence for QMCbased quadrature rules, with an order close to N −1 . However, we also consider the discontinuous indicator function I F (θ)>1.5 , to estimate the probability that the average flux in (21) becomes larger than 1.5, i.e., P F >1.5 = Prob (F (G(θ)) > 1.5) = E π I F (θ)>1.5 .
As we shall see, the non-smoothness of I F (θ)>1.5 reduces the order of convergence of the QMC quadrature to N −1/2 . For the same reason, this function lacks a low-rank TT decomposition, and hence we cannot compute its expectation using a tensor product quadrature directly. The mean field flux F | θ=0 = 1 (in the units used), and the probability P F >1.5 are both of the order of 0.1.
The default parameters used in the stochastic model and for function approximation are shown in Table 3. We will make it clear when we change any of those default parameters.

Set-up cost and accuracy of the TT approximation
The TT approximation π * can be computed directly by the TT cross algorithm, as in the previous examples. For a TT tolerance of δ = 0.1, this requires about 10 4 − 10 5 evaluations of π. However, since here the computation of each value of π(θ) involves the numerical solution of the PDE (19) this leads to a significant set-up time. This set-up time can be hugely reduced, by first building a TT approximationũ h (·, θ) of the FE solution u h (·, θ) and then usingũ h (·, θ) in the TT cross algorithm for buildingπ instead of u h (·, θ). It was shown in [8] that a highly accurate approximation of u h (·, θ) in the TT format can be computed using a variant of the TT cross algorithm, the alternating least-squares cross (ALS-cross) algorithm, that only requires O(r) PDE solves, if the TT ranks to approximate u h (·, θ) up to the discretization error are bounded by r. Moreover, the rank grows only logarithmically with the required accuracy. We will see, below, that r < 100 for this model problem for h = 2 −6 , significantly reducing the number of PDE solves required in the set-up phase.
Since the observation operator Q consists of integrals of the PDE solution over subdomains of the spatial domain D, when applied to a function given in TT format it can be evaluated at a cost that is smaller than r PDE solves on T h without any increase in the TT rank [8]. Finally, to compute an approximation of π via the TT cross algorithm we use the significantly cheaper TT surrogate Q(ũ h (·, θ)) in each evaluation of π(θ) instead of computing the actual FE solution u h (·, θ). Sincẽ u h (·, θ) is accurate up to the FE discretization error in V h -which in this model problem for h = 2 −6 is of O(10 −4 ) -this has essentially no impact on the accuracy of the resulting TT approximationπ (especially for δ = 0.1).
As in the shock absorber example, we test how the quality of the Markov chain produced by TT-MH depends on the error betweenπ and π. In Figure 5 (left), we show the rejection rates, IACT and error estimates (14), (15) for different stopping tolerances δ and grid sizes n. In the top plot, we fix δ = 10 −3 and vary n, while in the bottom plot, n is fixed to 512 and δ is varied. The other model parameters are set according to Table 3, and the chain length is N = 2 16 . The behaviour is as in the shock absorber example and as predicted in Lemma 1.
In Fig. 5 (right), we see that the TT ranks to approximate the FE solution u h (·, θ), and thus also the misfit functional Q(u h (·, θ)) − y, are significantly less than the TT ranks to approximate the density function π(θ) to the same accuracy. In both cases, the TT ranks show only a logarithmic dependence on δ, as stated above, and they are independent of n, for n sufficiently large. For the default parameters in Table 3, the ranks are 26 and 82, respectively, and the number of PDE solves to build the TT surrogates is about 100 and about 53000, respectively.

Convergence of the expected quantities of interest
In this section we investigate the convergence of estimates of the expected value of the quantities of interest, and the computational complexity of the different methods. For the TT approximation of the density function π we fix n = 32 and δ = 0.1. For the TT approximation of u h we choose a TT tolerance that is equal to the discretization error, which for h = 2 −6 is about 10 −4 .
To compute the posterior expectations of the QoIs in (18) we compare three approaches that use our TT-CD sampling procedure:  sampling procedure from the approximate distribution π * .
[TT-qCV] (Sec. 4.3.1) Using QMC quadrature with respect to the the approximate density π * in a similar way to a control variate. Bias correction uses a Metropolis-Hastings procedure targeting the correct density π, with independence proposals sampled via the TT-CD sampler.
[TT-qIW] (Sec. 4.3.2) Using the approximate density π * as an importance weight and estimating the expected value and the normalizing constant via QMC quadrature.
To benchmark the TT approaches, we use again DRAM with the initial covariance chosen to be the identity and discard the first N/4 samples. However, as a second benchmark, we also compute the posterior expectation directly by applying QMC to the two terms in the ratio estimate (QMCrat), as defined in (18) and analysed in [45]. The QMC method in TT-qCV and TT-qIW is again the randomized rank-1 lattice rule with product weights γ k = 1/k 2 and generating vector from the file lattice-39102-1024-1048576.3600 at http://web.maths.unsw.edu.au/~fkuo/. In the TT-qCV approach, the numbers N 0 and N 1 of samples for the two parts of the estimator are chosen adaptively, as in [15,26], to optimize the computational efficiency for a given accuracy. As discussed in Section 4.3.1, for smooth QoIs we expect a relationship close to N 1 ∼ εN 2 0 , whereas for non-smooth QoIs it will be closer to N 1 ∼ εN 0 , where ε is the accuracy of the TT approximation of π.
In order to reduce random fluctuations in the results, we average 16 runs of each approach in each experiment. The rejection rate and the IACT for TT-MH and DRAM are shown in Table 4. Notice that the autocorrelation times of DRAM for the coordinates θ and for the quantity of interest F differ significantly, since the latter coordinates have a weaker influence on F . In Figure 6, we present the relative errors in the quantities of interest versus the chain length N together with reference slopes. For the expected value E π [F ] of the flux in Fig. 6 (left), the QMC ratio estimator (QMC-rat) converges with a rate close to linear in 1/N , so that it becomes competitive with the TT approaches for higher accuracies. However, by far the most effective approach is TT-qIW, where the TT approximation π * is used as an importance weight in a QMC ratio estimator. Asymptotically, the convergence rate for TT-qIW is also O(N −1 ) for E π [F ] and the effectivity of the estimator is almost two orders of magnitude better than that of DRAM. All the other TTbased approaches and DRAM converge, as expected, with the standard MC order N −1/2 . For the non-smooth indicator function employed in P F >1.5 in Fig. 6 (right), the relative performance of the different approaches is similar, although the QMC-rat estimator now also converges with the MC rate of order O(N −1/2 ). Somewhat surprisingly, the TT-qIW method seems to converge slightly better than O(N −1/2 ) also for P F >1.5 and outperforms all other approaches by an order of magnitude.
The results in Fig. 6 are all computed for the same spatial resolution of the forward model. In a practical inverse problem, for the best efficiency, all errors (due to truncation, discretization and sampling) are typically equilibrated. Thus, it is useful to estimate the spatial discretization error. We achieve this by computing the posterior expectations of the QoIs on three discretization grids (with TT-qIW and N = 2 18 ) and by using these to estimate the error via Runge's rule. The estimated error for h = 2 −6 is plotted as a horizontal dashed line in Fig. 6. We see that with the TT-qIW method N = 2 13 samples are sufficient to obtain a sampling error of the order of the discretization error for E π [F ], while all other approaches require at least N = 2 17 samples (up to N > 2 21 for DRAM).
In Fig. 7 we compare the approaches in terms of total CPU time. The horizontal off-set for all the TT based methods is the time needed to build the TT approximationπ. The error then initially drops rapidly. As soon as the number N of samples is big enough, the set-up cost becomes negligible and the relative performance of all the approaches is very similar to that in Fig. 6, since the computational time per sample is dominated by the PDE solve and all approaches that we are comparing evaluate π for each sample. It is possible to significantly reduce this sampling cost, if we do not evaluate the exact π for each sample, e.g. by simply computing the expected value of the  QoIs with respect to the approximate density π * using TT-CD and QMC quadrature. However, in that case the estimator will be biased and the amount of bias depends on the accuracy of the TT surrogate π * . In that case, the total cost is dominated by the set-up cost (a more detailed study of the cost of the various stages of our TT approach is included in Fig. 10 below.) In Fig. 8, we include a more detailed study of the influence of the TT parameters n and δ. As expected, a more accurate TT surrogate provides a better proposal/importance weight and thus leads to a better performance, but it also leads to a higher set-up cost. So for lower accuracies, cruder approximations are better. However, the quality of the surrogate seems to be less important for Monte Carlo based approaches. For the middle plot in Fig. 8, we used the importance weighting method described in Sec. 4.3.2 with random Monte Carlo samples (TT-rIW). The quality of the surrogate seems to be significantly more important for the QMC-based approaches, such as for TT-qIW (Fig. 8, right), since the mapped QMC samples carry the PDF approximation error.
We also benchmark the algorithms in a more challenging scenario of a smaller noise variance σ 2 n = 10 −3 , see Fig. 9. Due to nonlinearity of the forward model, the posterior density function is concentrated along a complicated high-dimensional manifold, for smaller σ n . This increases all complexity indicators: the ranks of the TT approximation, the IACT in TT-MH and in DRAM and the variances in the ratio estimators. Since the density function is more concentrated, we choose finer parameters n = 64 and δ = 0.03 for the TT approximation. Nevertheless, Fig. 9 shows that even though the set-up cost is larger, the TT-based samplers are still all significantly more efficient than DRAM. Due to the stronger concentration of π, the performance of the basic ratio estimator QMC-rat is worse. On the other hand, the QMC estimator TT-qIW with TT importance weighting is again the most performant method. Notice that it is the only method that reduces the quadrature error to the size of the discretization error within the considered limit of one million samples.
Finally, we profile the computational cost of all the various components in the TT approaches with respect to the total error (truncation, spatial discretization and quadrature). We vary the spatial mesh size h from 2 −5 to 2 −7 and estimate the convergence rate of the discretization error (Fig. 10, left). Then, we choose the other approximation parameters in order to equilibrate the errors. In particular, the number of random variables d and the number of samples N are chosen such that the KL truncation error in (20) and the quadrature error of the TT-qIW method are equal to the discretization error, respectively (see Fig. 10, left).
The solid lines in Fig. 10 (right) give the computational times necessary for the various components of our algorithm (with all errors equilibrated), as a function of d (and thus also as a function of h −1 and N ): the ALS-Cross algorithm to build the TT surrogate of u h , the TT cross algorithm to build the TT surrogate of π, the TT-CD sampling procedure for the N samples x , = 1, . . . , N Figure 9: Inverse diffusion problem: Relative errors in the mean flux (left) and in the exceedance probability (right) plotted against the total CPU times (sec.) for σ 2 n = 10 −3 . and the evaluation of π at the N samples. Clearly the N PDE solves in the evaluation of π are the dominant part and the complexity of these evaluations grows fairly rapidly due to the spatial mesh refinement and the increase in N . The TT cross algorithm for buildingπ (once a TT surrogate of the forward solution is available) and the cost of the TT-CD sampler depend on the dimension d and on the TT ranks ofπ (which grow very mildly with d and h −1 ). In addition, we also ran all the experiments with h = 2 −6 and N = 2 14 fixed, varying only d to explicitly see the growth with d. The timings for these experiments are plotted using dashed lines. The cost for the ALS-Cross algorithm to buildũ h grows cubically in d, while the cost to build the TT surrogateπ and the cost of the TT-CD sampling procedure grow linearly with d, making the TT-CD sampler an extremely effective surrogate for high dimensions. Since the evaluation of π is dominated by the cost of the PDE solve, its cost does not grow with dimension.

Conclusion
We presented a method for computational inference based on function approximation of the target PDF. That task has traditionally been viewed as infeasible for general multivariate distributions due to the exponential growth in cost for grid-based representations. The advent of the tensor train representation, amongst other hierarchical representations, is a significant development that circumvents that 'curse of dimensionality'. Our main contributions here have been showing that the conditional distribution method can be implemented efficiently for PDFs represented in (interpolated) TT format, and that quasi-Monte Carlo quadrature is both feasible and efficient with bias correction through a control-variate structure or via importance weighting. The latter scheme was most efficient across all computed examples and parameter choices.
We adapted existing tools for tensors, i.e., multi-dimensional arrays, in particular the TT cross approximation scheme, and tools for basic linear algebra. We expect that substantial improvement could be achieved with algorithms tailored for the specific tasks required, such as function approximation, and the setting of coordinates and bounding region. Nevertheless, the algorithms presented are already very promising, providing sample-based inference that is more computationally efficient Extensive computations showed that in each example the methods performed as theory predicts, and that scaling with dimension is linear. We view the methods developed here as a promising development in Markov chain Monte Carlo methods. It is noteworthy, however, that our most efficient algorithm implements neither a Markov chain for the basic sampler, nor uses standard Monte Carlo quadrature. Instead, points from a quasi-Monte Carlo lattice are mapped into state space by the inverse Rosenblatt transform, implemented in the TT-CD algorithm, with unbiased estimates available as importance-weighted quasi-Monte Carlo quadrature (the TT-qIW algorithm). Nevertheless, the basic structure remains a proposal mechanism that is modified to produce a sequence of points that is ergodic for the target distribution.
Numerical experiments were carried out in Matlab R2016b on an Intel Xeon E5-2650 CPU at the Balena High Performance Computing Service at the University of Bath, using one core per run. We implemented Algorithm 2 in Matlab and C+Python, using the TT-Toolbox in Matlab [39] and Python 5 , respectively. The code is available at http://github.com/dolgov/tt-irt; we welcome suggestions or feedback from users.