Low-rank tensor methods for Markov chains with applications to tumor progression models

Cancer progression can be described by continuous-time Markov chains whose state space grows exponentially in the number of somatic mutations. The age of a tumor at diagnosis is typically unknown. Therefore, the quantity of interest is the time-marginal distribution over all possible genotypes of tumors, defined as the transient distribution integrated over an exponentially distributed observation time. It can be obtained as the solution of a large linear system. However, the sheer size of this system renders classical solvers infeasible. We consider Markov chains whose transition rates are separable functions, allowing for an efficient low-rank tensor representation of the linear system’s operator. Thus we can reduce the computational complexity from exponential to linear. We derive a convergent iterative method using low-rank formats whose result satisfies the normalization constraint of a distribution. We also perform numerical experiments illustrating that the marginal distribution is well approximated with low rank.


Introduction
The dynamics of cancer can be studied using tumor progression models, cf. (Beerenwinkel et al. 2014). These models describe the evolving genotype of a tumor as a continuous-time Markov chain. A model includes d genomic loci that may or may not be mutated. It starts out with all mutations absent and then progressively accumulates mutations (or other genomic events such as copy number alterations). The number of possible states of the tumor is thus 2 d . In typical applications one is interested in probability distributions over this state space that are far from stationary. Extrapolating the future course of a given tumor requires the computation of transient distributions. However, since the age of a tumor and thus the time point of an observation of the Markov chain is generally unknown, we study the transient distribution integrated over an exponentially distributed observation time, called the time-marginal distribution. It can be obtained as the solution of a large linear system. Today at least d = 299 genes are known to drive tumor progression, cf. (Bailey et al. 2018), i.e., a single distribution in R 2 299 would require the storage of more entries than there are atoms in the observable universe. This phenomenon is called state-space explosion (Buchholz and Dayar 2007) and renders classical methods for calculating or storing distributions infeasible. This problem is also known in other application areas, e.g., chemical-reaction networks (Anderson et al. 2010), the chemical master equation (Kazeev et al. 2014), Hamiltonian dynamics (Haegeman et al. 2016), queuing networks (Chan 1987), evolutionary dynamics (Niederbrucker and Gansterer 2011) or stochastic neural networks (Yamanaka et al. 1997).
Our main goal is to develop a method for calculating or approximating the marginal distribution. Due to the state-space explosion, methods for calculating the entire marginal distribution in the context of tumor progression have been limited to about d = 25 genomic events (Schill et al. 2019). In (Gotovos et al. 2021) it is demonstrated that learning models of this size from data is an underspecified problem and that increasing the number d can make inference more robust. This requires a tractable approximation of the marginal distribution. To allow for a probabilistic interpretation of this approximation, the latter should be a probability distribution itself, i.e., all entries should be non-negative, and the sum of all entries should be equal to one. Here, we neglect the non-negativity to ensure an error-controlled approach, see, e.g., (Kim et al. 2013) for an overview. So the following questions need to be answered: 1. How can we overcome the state-space explosion? 2. Subsequently, how can we determine marginal distributions? 2. At the same time, how can we ensure that a solution sums to one?
In order to address question 1. we use so-called low-rank tensor formats. These are known to reduce the cost from exponential to linear in the number of events d provided the distribution exhibits a low-rank structure.
To do so, we model Markov chains of interacting processes as Stochastic Automata Networks (Plateau and Stewart 2000). These allow for a representation of the infinitesimal generator as a sum of Kronecker products. In the context of low-rank tensors, such a representation is referred to as CANDECOMP/PARAFAC (CP) format (Carroll and Chang 1970;Harshman 1970), and the number of elements in this sum is called CP rank. As there are dependencies between individual processes, the transition rates, i.e., entries of the generator, depend on the current state of the Markov chain. We model the transition rates in a separable way where each factor is defined by the current state in one automaton.
Based on the CP representation of the infinitesimal generator, such representations can also be derived for the operator as well as for the right-hand side of the linear system defining the marginal distribution. We use these low-rank tensor representations to overcome the state-space explosion, cf. question 1., and to compute the marginal distribution, cf. question 2.
Strategies which have been used so far to calculate the marginal distribution cannot be generalized to low-rank tensors. Existing methods for solving general systems in low-rank tensor formats can be roughly divided into optimization-based approaches and iterative procedures, see, e.g., (Grasedyck et al. 2013) for an overview. Here we focus on iterative ones and discuss related works in Sect. 3.2. In low-rank formats, the effort for storage and execution of arithmetic operations depends, as the name already suggests, on the rank. Arithmetic operations needed in iterative procedures lead to an increase in the representation rank. One way to counteract this rank increase is the so-called truncation, i.e., the approximation of a tensor by one of lower rank. This allows for reducing storage and computational costs and thereby for efficient iterative procedures in low-rank formats. Beside the CP format there are several other lowrank tensor formats known. Here we focus in particular on the hierarchical Tucker format (Hackbusch and Kühn 2009;Grasedyck 2010). A special feature of these compared to the CP format is the possibility of accurate truncation. In (Hackbusch et al. 2008) it is proven that a convergent iterative method which is supplemented by truncation still converges if the approximation error is small enough.
While guaranteeing convergence, iterative methods supplemented by truncation do not ensure that their results sum to one anymore. In order to address question 2., we derive a novel iterative method based on the Neumann series (Dubois et al. 1979) and the uniformization method (Grassmann 1977) using low-rank tensor formats. We verify that its result sums up to one and prove its convergence. In our numerical experiments, we focus on the concept of Mutual Hazard Networks (Schill et al. 2019) for tumor progression. Our experiments illustrate that the marginal distribution can be approximated by low-rank tensors using our new algorithm.
This work is organized as follows. In Sect. 2 we derive the linear system that defines the time-marginal distribution of a continuous-time Markov chain. Starting from the concept of Mutual Hazard Networks and Stochastic Automata Networks explained there, we define a model class of Markov chains describing interacting processes. For these we will then derive low-rank tensor representations for the operator and the right-hand side of the linear system. In Sect. 3 we introduce the concept of tensors. We review the hierarchical Tucker format and discuss related work. Using these formats we derive an iterative method and prove its convergence. In Sect. 4 we perform numerical experiments based on the concept of Mutual Hazard Networks. In Sect. 5 we conclude.

Time-marginal distribution
A continuous-time Markov chain is defined by its state space S, its infinitesimal generator Q and an initial distribution q. In this paper we assume that the state space S is discrete. The generator is an operator Q ∈ R S×S , i.e., a linear mapping from R S to R S , which stores the rates of transition from state x to another state y in Q y,x ≥ 0 and the rates of staying in a state x in Q x,x ≤ 0. By construction each column of Q sums up to 0. 1 The probability distribution p(t) as function of the time t ≥ 0 is defined as the solution of the initial value problem i.e., p(t) = exp (Qt) q for all t ≥ 0. We make the common assumption that every trajectory starts at the same state, i.e., the initial distribution p(0) = q ∈ R S is a canonical unit vector.
Here, we assume that the observation time t is unavailable, such as, e.g., in tumor progression modeling, and thus must be treated as a random variable. Therefore, we are interested in a so-called time-marginal distribution p which is independent of the time t and which we will call marginal distribution for brevity. Each entry p x of the marginal distribution indicates the probability of observing a state x ∈ S at a random time. We follow the common assumption that the sampling time is an exponentially distributed random variable with rate 1, i.e., t ∼ Exp[1]. Similar approaches can be found, e.g., in Hjelm et al. (2006); Beerenwinkel and Sullivant (2009);Schill et al. (2019); Gotovos et al. (2021). Please note that for generalt ∼ Exp[λ] with λ > 0 the same results can be obtained by simply rescaling the time, i.e., t :=t λ ∼ Exp[1]. Here, we have The following lemma shows that this integral exists.
Lemma 1 Let Q ∈ R S×S be the infinitesimal generator of a continuous-time Markov chain over the discrete state space S. Then the integral (2) exists and Proof For the existence of the improper integral we show that the spectrum σ (Q − Id) of the operator Q − Id fulfills σ (Q − Id) ⊆ C − := {z ∈ C | Re(z) < 0}. Applying the Gershgorin circle theorem (Gershgorin 1931) Then the statement follows from direct calculation. This result can also be obtained by understanding the process as an absorbing Markov chain where from each state a transition with rate 1 enters the absorbing state. Then the time-marginal distribution equals the distribution just before absorption, see also (Gotovos et al. 2021, Proposition 2). From this point of view, the statement follows from the theory of absorbing Markov chains.
Hence, the marginal distribution p is defined as the unique solution of a linear system, since the operator Id − Q is regular.
Next, we specify the class of Markov chains for which we compute the timemarginal distribution p. We start with an introduction into a specific type of model used in tumor progression and then generalize this type based on the concept of Stochastic Automata Networks (Plateau and Stewart 2000). All these Markov chains will offer a sparse representation of the infinitesimal generator Q in order to overcome the state-space explosion.

Modeling tumor progression via Mutual Hazard Networks
As mentioned in the introduction, tumor progression can be modeled as continuoustime Markov chain over a discrete state space S. Typically, one is interested in a transient distribution, but the time point of observation, i.e., the age of the tumor, is unknown. Nevertheless, in order to be able to make statements about the probability distribution for all possible tumors, the time-marginal distribution is needed.
For tumor progression, the state space S can be modeled as follows. By consideration of d genomic events, as point mutations, copy number alterations, or changes in DNA methylation, each state x ∈ S represents the genotype of a tumor by indicating whether a genomic event has occurred or not. Modelling a state x = (x 1 , . . . , x d ) ∈ S as a vector of length d with x i = 1, if event i has occurred, and otherwise x i = 0, the set of all possible states (or tumors respectively) can be represented as Thus, the number of possible tumors increases exponentially in the number d of genomic events considered, also known as state-space explosion.
In tumor progression modeling, the transition rates are usually unknown, and therefore certain assumptions have to be made. Here we focus on the concept of Mutual Hazard Networks (Schill et al. 2019) which offers a sparse representation of the infinitesimal generator Q. In (Schill et al. 2019) the following assumptions define a Mutual Hazard Network: (i) All events are assumed to occur one after another. (ii) All events are irreversible, i.e., there are no transitions from state x ∈ S with x i = 1 to y ∈ S with y i = 0 for an event i. (iii) The occurrence of an event depends on the genotype of the tumor in a separable way. This means that there are mutual effects between events on their rate of occurrence which can be factorized, and each factor is described by a certain event. (iv) Genomic events that have not occurred yet have no effect on the transition rates of all others.
According to assumption (iii) a first-order Cox proportional hazard model (Cox 1972) is used to specify the transition rates, i.e., there are parameters Θ ∈ R d×d defining the mutual effects on transition rates from state x to y, where x and y only differ in one event i with x i = 0 < y i = 1. To be precise, the parameter Θ i, j := Θ(x i → y i , x j ) ≥ 0 is the (multiplicative) effect of state x j for event j on the transition from x i to y i for event i for events i = j. The parameter Θ i,i := Θ(x i → y i , x i ) ≥ 0 is then the baseline rate of transition from x i to y i for event i. Following (iv) all effects Θ(x i → y i , x j ) with x j = 0 and i = j are equal to 1, i.e., they are neutral multiplicative effects in (5). This modeling allows for describing different mutual effects between genomic events on their transition rates, see (iii). Each box represents a genomic event relevant to breast cancer and each line a direct effect between them. A dashed line shows a unilateral effect and a solid line identifies a reciprocal effect. In addition, we can distinguish effects based on their nature and magnitude. In the present case, an amplification on the p-arm of the 16th chromosome (event +16p) inhibits a deletion on the p-arm of the 8th chromosome (event -8p) with Θ −8 p,+16 p = 0.9. Such an inhibitory effect of event j on i is indicated in the network by a red connection (lines without marks) and corresponds to a parameter Θ i, j < 1. Similarly, event +16p promotes an amplification on the q-arm of the 20th chromosome (event +20q) with Θ +20q,+16 p = 1.6. A promoting effect of event j on i is indicated by blue connection (lines with circles) in the network and corresponds to a parameter Θ i, j > 1. Neutral effects are represented in Fig. 1 by the absence of edges, e.g., event +16p is not connected to event +1q which means that event +16p does not affect event +1q directly. At the level of parameters this corresponds to a multiplicative effect of Θ i, j = 1. As can already be seen in Fig. 1, in typical applications most of the direct effects between events are assumed to be neutral. Beyond these direct effects, events can also influence each other indirectly. In the example shown, event +16p is not associated with event +1p, so it has a neutral direct effect on event +1p. However, event +16p favors the occurrence of event +20q, which itself inhibits event +1p. Excluding all other effects, event +16p could therefore indirectly have an inhibitory effect on event +1p (despite Θ +1q,+16 p = 1). The diagonal entries of Q follow directly from (5) since each column sums up to 0, i.e., Q x,x = − y =x Q y,x . Together with the separation in (5) this allows for the following representation of the infinitesimal generator, cf. (Schill et al. 2019): Instead of |S| · |S| = 2 2d entries, one only has to store 4d 2 entries. In the following, we will observe that the structure in (6) of the operator Q will allow us to use tensor formats for the calculation of the time-marginal distribution. While this is a specific model for tumor progression on genomic data, the question arises whether a similar structure of Q can be used for more general models with a larger range of applications. In general, the progression of disease is not an irreversible process, and symptoms and traits can arise and then abate, such as inflammation (Zhao et al. 2021). They also do not necessarily have to be modeled in a binary way, but can have different levels of severity, such as fever (Johnston et al. 2019). The same formalism can be used to model the attrition and maintenance of technical systems of interacting components (Amoia et al. 1981). To do this, we introduce the concept of Stochastic Automata Networks (Plateau and Stewart 2000) which are known to offer a similar representation for the infinitesimal generator Q. We note that the Markov chains defined by a Mutual Hazard Network also belong to this class of Stochastic Automata Networks.

Stochastic automata networks
For a continuous-time Markov chain of interacting processes the discrete state space factorizes in a natural way into the state spaces of the individual processes. Each process is itself a Markov chain over its own state space S i and is called a stochastic automaton A i . The set {A 1 , . . . , A d } of d stochastic automata is called a Stochastic Automata Network, cf. (Plateau and Stewart 2000). The full state space S consists of all possible combinations of states in each automaton, i.e., for n i := |S i | and n := max i n i it is given by Each state  (4). We now consider transitions between states in the full state space S, which are given by one or more transitions between states within the individual state spaces S i . 2 It is known that the infinitesimal generator Q of a Stochastic Automata Network can be represented as a sum of Kronecker products, i.e., for d automata and m transitions it is given by cf. (Plateau and Stewart 2000). Again, instead of |S| · |S| ≤ n 2d entries, one only has to store (2m + d) dn 2 entries. A specific way to define the matrices Q ( j) i is given in (6) for a Mutual Hazard Network, where the number of possible transitions m is equal to the number of events d. Note that the reduction of 2m + d to d terms in the sum results from the separability of the transition rates in (5).
Based on the formalism of Stochastic Automata Networks we extend the concept of Mutual Hazard Networks. On the one hand, the separability and parameterization should be preserved. On the other hand, an extension of the models and an enlargement of the possible application areas should be made possible by relaxing the assumptions.

Generalized modeling
We focus on a continuous-time Markov chain represented as a Stochastic Automata Network with d automata {A 1 , . . . , A d }. As in Sect. 2.3 each automaton A i has a state space S i with |S i | = n i , and the state space of the network is given by S = × d i=1 S i . According to (i) and (iii) we make the following assumptions: (I) There is only one transition in one automaton at a time.
(II) Each transition depends on the current global state in a separable way, i.e., each transition rate can be factorized, and each factor is described by the current local state in one automaton.

For each automaton
Again, we model the mutual effects on local transitions using parameters Θ. Each parameter Θ(t i , x j ) ≥ 0 is the effect of state x j in automaton A j on the transition t i ∈ T i in automaton A i for i = j and the baseline rate for i = j. Thus, the transition rate from state x to y in S, where x and y differ only in automaton A i and The diagonal entries of the generator are again given by Q x,x = − y Q y,x , and thus we specify a data-sparse representation by For i = j, Q (t i ) j is a diagonal matrix containing the effects of all states in automaton A j on the transition t i in A i , i.e., its diagonal entries are given by the parameters contains only two non-zero entries: The entry corresponding to Similar to the representation of Q for the Mutual Hazard Networks in (6), the number of terms in the sum is given by the overall number of possible transitions in the Markov chain, i.e., d i=1 |T i |. Compared to the general representation (8) for Stochastic Automata Networks, the reduction of terms is possible due to the separability of transition rates (II).
In contrast to the model based on Mutual Hazard Networks, this generalized version allows for different and larger local state spaces (instead of binary S i = {0, 1} for all i) and does not require specific restrictions as irreversibility or specific neutral effects of states. In particular, the automata may be cyclic. While the data-sparse structure of the infinitesimal generator Q based on certain parameters Θ is preserved. In the numerical experiments, we will then refer to the case of Mutual Hazard Networks, which have already been used in practical simulations for tumor progression (Schill et al. 2019;Gotovos et al. 2021).

Structure of the linear system
As mentioned in Sect. 1, the solution of the linear system in Schill et al. (2019) based on classical methods was limited to about d < 25 automata.
In order to overcome the state-space explosion, we need to go beyond classical methods. We have already given a data-sparse representation of Q, also in the more general case of Sect. 2.4. Note that the identity Id = d i=1 Id S i and the initial distribution can be written as Kronecker products, where Id S i ∈ R S i ×S i is the identity operator on R S i and e z i ∈ R S i is the z i -th canonical unit vector. Hence, the operator and the right-hand side of (10) have a representation as a short sum of d Kronecker products, which allows for efficient storage. We still need a low-rank method to solve the linear system, which is the subject of the next section.

Low-rank method to compute the marginal distribution
We now compute the marginal distribution p in a data-sparse way. To avoid losing this sparsity when performing arithmetic operations we regard our operators and distributions as tensors by interpreting the Kronecker products as tensor products.

Low-rank tensor formats
We view tensors as multidimensional generalizations of vectors and matrices, i.e., of one-dimensional and two-dimensional tensors.
Definition 1 (tensor) Let d ∈ N and I = × d i=1 I i be a Cartesian product of discrete index sets I i . An object B ∈ R I is called a tensor of dimension d. Each direction i ∈ {1, . . . , d} is called a mode of B, and the cardinality of the i-th index set |I i | is called the i-mode size.
In our case, the index set I corresponds to the state space S = × d i=1 S i , our distributions p, p(0) ∈ R S are tensors of dimension d, and the automata A i correspond to the modes with sizes n i = |S i |.
In the language of tensors, the structure of Q in (9) is an example of the socalled CANDECOMP/PARAFAC (CP) format introduced in Carroll and Chang (1970); Harshman (1970).
Then r ∈ N 0 is called the CP representation rank, and the b ( j) i are called the CP factors of B. The minimal r ∈ N 0 such that B has such a CP representation (11) is called the CP rank of B.
The infinitesimal generator Q in (9) has CP representation rank d i=1 |T i | ≤ dn 2 . The identity Id as well as the right-hand side p(0) have CP rank 1. A core advantage of the CP format is the data sparsity in case of small representation rank r : The representation (11) We use the Onotation for storage and computational costs to describe that costs in O( f (ω)) grow asymptotically no faster than const · f (ω) for a constant const > 0 and a function f depending on certain value ω.
Since we want to compute the marginal distribution p, we have to solve a linear system whose operator and right-hand side have a CP representation each. For an operator with CP rank r > 1 it is unknown how to calculate its inverse analytically. Hence, we need a solver and arithmetic operations to compute the solution numerically. Performing arithmetic operations such as adding two tensors and applying an operator results in an increase in the representation rank. In the CP format, this increase in representation ranks can be traced easily: For example, if we add two tensors B 1 and B 2 with representation ranks r 1 and r 2 by appending the CP factors of B 2 to those of B 1 , the sum B = B 1 + B 2 already has representation rank r 1 + r 2 . Similarly, applying a CP operator with representation rank r 1 to a CP tensor with representation rank r 2 , by applying the operator factor by factor to the CP tensor, results in a CP tensor with representation rank r 1 · r 2 .
To counteract this increase in the representation rank, arithmetics in low-rank formats is supplemented by a so-called truncation, i.e., approximating a tensor with one of lower representation rank. As the set of tensors with CP rank at least r is not closed for d > 2, low-rank approximation within the CP format is an ill-posed problem, see de Silva and Lim (2008), and usually optimization-based methods are used for this purpose. However, we overcome this drawback by using tensor formats that allow for truncation based on the singular value decomposition of matrices.
To do so, a high-dimensional tensor is reshaped into a matrix by selecting modes defining its rows, while all others define its columns. The resulting matrix, called matricization, is defined, following (Grasedyck 2010), as follows.
The matricizations of a three-dimensional tensor corresponding to the single modes {1}, {2} and {3} are visualized in Fig. 2. For a matricization of a tensor, the classical singular value decomposition can be used for low-rank approximation. So-called tree tensor formats make use of this observation. A popular one is the tensor train format which was first introduced to the  numerical analysis community in Oseledets and Tyrtyshnikov (2009). It is also known in other areas as matrix product states (White 1992;Östlund and Rommer 1995) or as linear tensor network (van Loan 2008).
Here we focus on the hierarchical Tucker format which was first introduced in Hackbusch and Kühn (2009) and further analyzed in Grasedyck (2010). As the name suggests, the hierarchical Tucker format is based on a hierarchical subdivision of the modes {1, . . . , d}. This bisection of the modes is described by a binary dimension tree. The set of all modes is distributed from the root to the children until only single element subsets are left in the leaves. Formally, a dimension tree can be defined as follows: Definition 4 (dimension tree) A dimension tree T for dimension d ∈ N is a binary tree with nodes labeled by non-empty subsets of D := {1, . . . , d}. Its root is labeled with D and each node t (identified with its label) satisfies one and only one of the following conditions: (1.) t ∈ L(T ) is a leaf of T and labeled by a single element subset t = {i} ⊆ D.
(2.) t ∈ I(T ) := T \ L(T ) is an inner node of T and has exactly two children t 1 , t 2 which fulfill t = t 1 ∪ t 2 and t 1 ∩ t 2 = ∅.
An example for a dimension tree with dimension d = 8 is shown in Fig. 3. The rank of a representation in the hierarchical Tucker format depends on the bisection, i.e., on the dimension tree. For each node t there is a rank component r t  (Hackbusch 2012, (11.46c)) corresponding to the matrix rank of the matricization B (t) . Thus, the representation rank in the hierarchical Tucker format is a tuple depending on the tree T and will be denoted by r := (r t ) t∈T . Again, the minimal representation rank is called the hierarchical Tucker rank. Given a dimension tree T , truncation can be performed in a quasi-optimal way, cf. (Grasedyck 2010). There it is shown that the error in the 2 -norm caused by truncation of a tensor B ∈ R I to rank r = (r t ) t∈T is bounded by where σ t,m is the m-th singular value of the matricization B (t) , B best is a best rank-r approximation, and C < 2 is a small constant. This error bound allows for truncation with guaranteed accuracy. Thus, truncating after arithmetic operations allows us to perform iterative methods in an efficient way and preserves their convergence (Hackbusch et al. 2008). Inspired by (Hackbusch 2012, Chapter 13), Table 1 lists some of these operations, as well as their respective cost.
In the hierarchical Tucker format, the storage complexity grows only linearly in the dimension d. This allows for efficient storage and overcomes the state-space explosion provided we have low ranks. In our case the dimension of the distribution tensors is equal to the number d of automata, and the mode sizes are the numbers of states n i for each automaton A i .
It is possible to convert a CP representation of rank r easily to a hierarchical Tucker one, where all rank components are bounded by r independent of the dimension tree, cf. (Hackbusch 2012). In general, however, the rank in the hierarchical Tucker format depends on the selected bisection, i.e., on the dimension tree. Thus, the question arises how to choose the dimension tree in order to obtain a low rank. For further explanation on the choice of tree we refer the reader to more advanced papers (Grasedyck and Hackbusch 2011;Ballani and Grasedyck 2014). We will discuss this issue for the marginal distribution p in Sects. 4.2 and 4.3 using some numerical experiments.

Related work
For large Markov chains, the data-sparse structure of the operator has already been exploited for fast matrix-vector applications (Buchholz and Dayar 2007). However, this allowed only the operator, but not the distributions, to be stored in a data-sparse form. The distributions were stored entry by entry and thus suffer from the statespace explosion. In Schill et al. (2019) a strategy based on a splitting of the operator Id − Q = D + L in a diagonal matrix D and a strictly lower triangular matrix L was presented. Using L d = 0 and the Neumann series, the marginal distribution is given by Note that this equation requires inverting the diagonal operator D or solving the corresponding linear system. In contrast to the matrix case, inverting a diagonal operator in low-rank tensor formats is not straightforward, and it is unclear whether the solution itself has a low-rank structure. Solving each linear system involving D would result in solving (d + 1) systems with a comparable complexity as the original one. Thus, the strategy in (13) cannot easily be transferred to low-rank tensors to overcome the state-space explosion. Alternatively, Gotovos et al. (2021) recently proposed a strategy for learning the parameters of a Mutual Hazard Network without computing the marginal distribution for all states. Given a data set of observations, optimizing the log-likelihood requires the marginal probability of each state observed, which is represented as sum of probabilities for all possible transition sequences leading to this state. For an observation with m mutations present there are m! possible sequences. In order to deal with the sheer number of sequences, a stochastic approximation based on a Metropolis-Hastings algorithm is used. Instead of restricting the state space, here we want to exploit the given low-rank tensor structures of the model, cf. Sect. 2.5, to overcome the state-space explosion.
We briefly review existing strategies to compute distributions using low-rank tensor formats. The CP format has been used extensively to approximate, in particular, stationary distributions, e.g., in Kulkarni (2011), and to derive conditions for their existence, e.g., in Fourneau (2008). In Benson et al. (2017) stationary distributions for random walks are computed via an eigenvalue problem using the CP format. The tensor-train format was successfully used for the computation of, e.g., transient distributions (Johnson et al. 2010), mean times to failure (Robol and Masetti 2019), and stationary distributions (Bolten et al. 2018;Kressner and Macedo 2014). There mainly optimization-based approaches were presented. In Buchholz et al. (2016) the hierarchical Tucker format was applied to reduce the storage cost for distributions and the computational cost for performing basic operations. Continuing in Buchholz et al. (2017) adaptive truncation strategies for the computation of stationary distributions using iterative methods were presented. In Kressner and Macedo (2014) a power iteration based on a formulation of the stationary solution by an eigenvalue problem was derived, where after each application of the operator in the power iteration the current approximation was rescaled to ensure that it sums up to one. However, the time-marginal distribution we are interested in has not yet been studied in the context of low-rank tensors. For this reason, we present such a method in the following section.

Low-rank method
We now make use of low-rank tensor formats to approximate the marginal distribution p as the solution of (10). Since the exact solution p is a probability distribution, its entries sum up to one, i.e., it fulfills where 1 ∈ R S is the tensor of all ones. To allow for a probabilistic interpretation of an approximation of p, we need to treat (14) as an additional constraint. To this end, we now present an iterative method based on the Neumann series (Dubois et al. 1979) and the uniformization method (Grassmann 1977). The following lemma holds true for any discrete-state continuous-time Markov chain.
Lemma 2 Let Q ∈ R S×S be any infinitesimal generator over any discrete state space S and p(0) ∈ R S any initial distribution. For γ ≥ max x∈S |Q x,x | > 0 the marginal distribution can be represented as a Neumann series, and the series converges.
Proof Due to Lemma 1 the marginal distribution can be represented as In order to use the Neumann series for inverting Id − γ 1+γ Id + 1 1+γ Q =: Id − G γ , it is sufficient to show that the spectral radius ρ G γ < 1. Applying the Gershgorin circle theorem (Gershgorin 1931) Thus, the spectral radius is bounded by For x ∈ S and z ∈ Z x,γ we obtain using γ ≥ |Q x,x |: Since the spectral radius is smaller than 1, the Neumann series converges with Alternatively we can derive (15) using the uniformization method. Here, the idea is to describe a continuous-time Markov chain by a discrete-time Markov chain with a time increment that is exponentially distributed. With this interpretation, we write the time-dependent probability distribution in (1) as for a time t ≥ 0. Note that P γ is the transition probability matrix of a discrete-time Markov chain, since γ is a bound on the diagonal entries of Q. Marginalization of time similar to (2) and substitution leads to (15): where Γ denotes the gamma function. We now discuss approximation strategies for (15). A natural approximation would bep for k ∈ N which is an approximation from below (entry by entry), i.e.,p (k) x ≤ p x for all x ∈ S and all k ∈ N. Based on the properties of the Neumann series this sequence converges linearly to p, butp (k) satisfies the normalization condition (14) only approximately for k → ∞. As P γ is a transition probability matrix, its application to a probability distribution leads to a probability distribution again satisfying (14), and thus 1, for all k ∈ N. By scaling each element of the sequence, we obtain for k ∈ N, which is now an approximation to p that satisfies (14). We prove its linear convergence to p in the following theorem for any discrete-state continuous-time Markov chain.
Theorem 1 Let Q ∈ R S×S be any infinitesimal generator over any discrete state space S and p(0) ∈ R S any initial distribution. Let further p be the solution of the corresponding linear system (10), and p (k) be defined by (18) for all k ∈ N. Then p (k) converges linearly to p as k approaches infinity, i.e., Proof For α k := (1+γ ) k (1+γ ) k+1 −γ k+1 , α := 1 1+γ and any k ∈ N we have Furthermore we obtain Since γ 1+γ < 1, the linear convergence follows.
So far, we have not needed any additional assumptions on the Markov chain. However, we are particularly interested in high-dimensional problems and want to overcome the state-space explosion using low-rank tensor methods. Therefore, we now assume that both, the generator Q and the initial distribution p(0), have a lowrank tensor representation, as derived in Sect. 2. In order to employ low-rank tensor formats, we supplement the iterative method corresponding to (18) by truncation. As shown in Hackbusch et al. (2008), a convergent iterative method combined with truncation still converges if the truncation error is sufficiently small. However, the truncation could change the normalization of the iteration. Therefore, we replace the scaling with α k by dividing by 1, p (k) as shown in Algorithm 1. The procedure should be continued until the norm of the relative residual is smaller than a given tolerance tol > 0.
Algorithm 1 Low-rank uniformization(Q, p(0), γ , tol) p (k+1) 10: k = k + 1 11: end while 12: return p (k) /s (k) Note that when using the unnormalized approximationp (k) by replacing the normalization by s (k) with (1 + γ ), the norm of the relative residual turns out not to be a useful measure and one must also consider the distance of s (k) and (1 + γ ). By taking the relative residual of the normalized version, this is done implicitly.
Our sequence p (k) k is defined as a normalized version of p (k) k , which is an approximation to p from below (entry-by-entry). By estimating the Poisson distributed error ofp (k) , one can obtain for k ∈ N as unnormalized approximation to p from above (entry-by-entry) which also converges linearly to p. In numerical experiments (not further discussed here),p (k) and also its normalized version turned out to be unpromising, since significantly more iteration steps were required to achieve the same approximation quality measured by the relative residual.
In general, Algorithm 1 does not require assumptions (I) or (II) but Q and p(0) to have appropriate low-rank tensor representations and an upper bound γ ≥ max x |Q x,x |. In general, computing all diagonal entries of Q is in O(n d ). Nevertheless, here we focus on Markov chains satisfying (I) and (II) with an infinitesimal generator Q in (9) depending on parameters Θ. This allows for the following inexpensive upper bound: which can be computed in O d 2 n 2 T , 3 where T = max i |T i | is the maximum number of possible transitions in an automaton. In the case of strongly varying parameters Θ, (19) may be a significant overestimation for the diagonal entries of Q. The question of how to determine a tighter bound with polynomial effort using low-rank tensor formats will be dealt with in future work.

Numerical experiments
We illustrate our method for the computation of time-marginal distributions in numerical experiments based on the model of Mutual Hazard Networks with d automata (genomic events). In this case, the parameters can be summarized in a matrix Θ ∈ R d×d .

Construction of synthetic parameters
We consider d automata and parameters Θ ∈ R d×d >0 with a particular block-diagonal form. Each quadratic block of size b × b characterizes a subset of b automata which directly affect one another. We draw all within each block so that their logarithmic values are normally distributed with mean μ = 0 and standard deviation σ = 0.25. Unless stated otherwise, all other parameters are exactly 1 (logarithmic values of 0), which corresponds to a neutral direct effect between automata of different blocks. Based on these blocks we study three types of parameters: (B1) (Strict) block structure: There are direct effects, i.e., parameters Θ i, j = 1, only between automata in the same block. (B2) Neighbor block structure: In addition to the effects within each block in (B1), there are direct effects between randomly chosen automata in neighboring blocks. The parameters Θ i, j are drawn such that their logarithmic values log Θ i, j are normal distributed with mean 0 and σ = 0.125 and the automata (i, j) are uniformly randomized in neighboring blocks. For each pair of neighboring blocks 4 random effects. (B3) Neighbor block structure with additional random effects: Besides the effects in (B2), there are direct effects between randomly chosen automata of different blocks. Again the corresponding parameters Θ i, j are drawn such that their logarithmic values log Θ i, j are normal distributed with mean 0 and σ = 0.125.
The choice of (i, j) is also uniformly randomized. For each parameter matrix Θ ∈ R d×d >0 we add 8 random effects. The blocks in (B1) correspond to non-interacting subsystems of the process. In biological models, these often represent well-defined, distinct pathways of genes that regulate specific functions of the cell such as the cell cycle, apoptosis and growth (Sanchez-Vega et al. 2018). Note that even if subsystems do not interact, it is not possible to trivially express the solution as a Kronecker product of solutions per block, since correlations are induced by the observation time acting as a confounding variable, see also (Gotovos et al. 2021, Section 3.2). Finding such blocks or groupings, e.g., during parameter determination, can be important for understanding and treating tumors. However, in practice, the behavior of a biological system can rarely be separated into well-defined subsystems that do not interact, so (B1) is an oversimplification. Hence, we add interactions between different subsystems in (B2) and (B3). For parameters of type (B2) all automata or events already interact indirectly. For example, if event A 1 favors the occurrence of event A 2 , i.e., Θ 2,1 > 1, and at the same time event A 2 inhibits the occurrence of event A 3 , i.e., Θ 3,2 < 1, then event A 1 also indirectly inhibits event A 3 even if the direct effect of A 1 on A 3 is neutral. We reduce the standard deviation when sampling the effects between different blocks to simulate that the influences between different blocks are much smaller than those within.

Default settings for the algorithm
For the application of Algorithm 1 we always choose the bound γ as in (19), which for Mutual Hazard Networks can be reduced to Unless stated otherwise, we use a canonical balanced dimension tree for the application of the hierarchical Tucker format, i.e., the automata are assigned to the leaves following their ordering, see Fig. 4. We perform all experiments for 100 randomly generated sample parameters for each combination parameter type, number of blocks and number d of automata. We compute low-rank approximations of the marginal distribution p using Algorithm 1 with a maximum relative truncation error B − trunc(B) / B ≤ ε trunc = 10 −7 , see (12). The algorithm stops when the norm of the relative residual is smaller than a tolerance value of tol = 10 −4 . The mean values we compute are arithmetic means. In the following, we call the representation rank of this approximation tensor an approximation rank of p.

Study of singular values
One fundamental assumption for solving linear systems using low-rank tensor methods is that a solution can be well approximated by a tensor of low rank. According to the error bound in (12), the truncation error is determined by the singular values of To analyze this issue, we solve (10) using classical matrix methods of MAT-LAB (Natick 2019) and compute the singular values of the corresponding matricization using the htucker toolbox (Kressner and Tobler 2014). Figure 4 shows the decay of the singular values of each matricization corresponding to the canonical balanced tree for d = 8 automata and parameters of type (B1) with 2 blocks of size b = 4 each. For each node the semi-logarithmic plot displays the means of the singular values over 100 samples.
We observe that the singular values exhibit an exponential decay. The two matricizations closest to the root are transposes of each other, and therefore their singular values are identical. The smallest 5 of the 16 singular values are indistinguishable from zero and therefore cannot be displayed in the semi-logarithmic plot. For other standard deviation σ , we observed a similar drop in singular values (not shown here). The exponential decay of the singular values indicates that the marginal distribution p can be well approximated with low rank.

Study of the tree structure
In general, the rank in the hierarchical Tucker format depends on the structure of the tree. In the following, we study how the ordering of the automata in the leaves of the tree affects the low-rank approximability. We preserve the balanced binary structure of the tree because this is advantageous for parallelization, cf. (Grasedyck and Löbbert 2018).
We already studied the singular values of the matricizations corresponding to the canonical balanced tree, see Fig. 4. We now change only the arrangement of the automata in the leaves of the tree and compute the marginal distribution p again. The decay of the singular values for each matricization is shown in Fig. 5, where the semi- Comparing Figs. 4 and 5, we observe that the choice of the canonical dimension tree, i.e., the original ordering in the leaves, results in a significantly faster decay of the singular values close to the root. Figure 5 shows that the singular values closest to the root also have an exponential decrease, but at a much slower rate. Note that 2 blocks of size b = 4 imply that the automata {A 1 , A 2 , A 3 , A 4 } interact directly with one another. The same holds for the automata {A 5 , A 6 , A 7 , A 8 }. There are no direct interactions between automata of different blocks. Hence, the modified balanced tree separates strongly interacting automata, which explains the slower decay of the singular values in level 1 (level 0 being the root). In contrast, the canonical balanced tree separates weakly interacting automata, leading to a faster decay of the singular values. We will make similar observations when studying the approximation rank. Based on this observation, the question arises whether and how a suitable dimension tree can be determined. Ballani and Grasedyck (2014) developed a black box method to determine a dimension tree for general low-rank problems. In our concrete model problem, however, the parameters Θ already seem to give a-priori information about suitable structures. How exactly a dimension tree can be constructed based on the parameters is, however, not a trivial question and requires a more detailed investigation, which would exceed the scope of this paper. In the following, we will focus on the investigation of the presented method.

Comparison with a matrix-based method
We check our method in Algorithm 1 against a classical method based on matrix operations presented in Schill et al. (2019), see (15). To do so, we use Mutual Hazard Networks computed there and the corresponding infinitesimal generators Q. The parameters are optimized based on data for breast cancer, colorectal cancer, renal cell carcinoma and glioblastoma which represent mutual effects between d = 10, 11, 12 and 20 genomic events each. The specific parameters and their interpretation can be found in Schill et al. (2019) and its supplementary material. The network for breast cancer is also shown in Fig. 1. We use the settings in Sect. 4.1.2 for Algorithm 1. We compare the result p tensor of Algorithm 1 using tensor methods and p vector of (13) based on matrix operations with respect to the relative residual (Id − Q)p − p(0) / p(0) and the distance of both results. We measure the distance using the relative Euclidean distance p vector − p tensor / p vector and the Kullback Leibler (KL) divergence (Kullback 1997) from p tensor to p vector . The KL divergence from p tensor to p vector is defined as and measures how p tensor differs from p vector with respect to p vector . Since p tensor is only an approximation of the marginal distribution p, there might be some negative values which are absolutely small. To deal with this problem, we use the convention 0·log(0) = 0 and cut off (p vector ) x ≤ ε cut-off = 10 −8 to zero. Besides, we list the costs for storing p vector and p tensor , respectively, in values which have to be stored for each solution. Table 2 summarizes the values for four different Mutual Hazard Networks corresponding to different types of tumors. By construction, the residuals for p tensor of Algorithm 1 are slightly below the requested tolerance tol = 10 −4 . For smaller values of tol (not shown in Table 2) smaller relative residuals for p tensor can be reached by increasing the number of iteration steps. The results p vector of (13) solve the equation nearly exactly. The relative Euclidean distances as well as the KL divergences from p tensor to p vector have very small absolute values. This implies that both results describe nearly the same distribution. Calculating the KL divergences for smaller cut-off parameters ε cut-off ∈ {10 −9 , . . . , 10 −16 }, we obtain the same values for the first three networks as with ε cut-off = 10 −8 . For the last (and largest) network, if ε cut-off ≤ 10 −10 , we observe some negative entries with very small absolute values, i.e., |(p tensor ) x | ≤ 10 −9 . This can be explained by the truncation used in Algorithm 1 to reduce the storage and computational complexity. Hence, we need ε cut-off > 10 −10 for this network. Calculating KL divergences requires the explicit evaluation of all entries of the tensors p tensor , since an efficient method for the entrywise application of the logarithm within low-rank tensor formats is unknown. This limits the computation of KL divergences to tensors of small dimension d. Comparing the storage costs for p vector and p tensor , we observe an overhead of storing the tensor solution compared to the vector one for diseases with d ≤ 12. However, already for d = 20 the storage cost for the vector solution explodes, whereas the cost for the tensor solution remains moderate with about 4% of the cost for p vector .
In Fig. 6, we further modeled the storage requirements for potential vector and tensor solutions as a function of d up to 50 in a semi-logarithmic plot. For this purpose, we made the simplified assumption of constant ranks r = (r , . . . , r ) for the hierarchical Tucker representation of p tensor with r ∈ {8, 16, 32}.
Again, we note an overhead of storing the tensor compared to the vector representation for d ≤ 20, but for d > 20 the tensor representation is preferable. Figure 6 shows the exponential increase of storage cost for the vector representation in the number of automata, i.e., the state-space explosion. In contrast the cost for the hierarchical Tucker representation grows only linearly in d as expected from Table 1. A low representation rank is essential in order to overcome the state space explosion, as it enters cubically into the storage cost. We will now investigate this in more detail.

Study of the approximation rank
The rank r = (r t ) t∈T in a hierarchical Tucker format is a tuple depending on the underlying tree T . For a simple comparison of tensor representations, we consider the effective rank r eff , which we define such that the storage cost for the representation equals the cost to store one with rank r = (r eff ) t∈T . In doing so, we round r eff up to the nearest integer.
We compute low-rank tensor approximations of the marginal distribution p using Algorithm 1 as described in Sect. 4.1. We plot the effective ranks as a function of the number d of automata for fixed maximum relative truncation error ε trunc = 10 −7 and fixed tolerance tol = 10 −4 in Fig. 7a, and as a function of ε trunc for fixed d in Fig. 7b. Both plots show mean values of effective ranks for 100 samples of parameters Θ each.  Fig. 7a and as a function of the maximum relative truncation error ε trunc in Fig. 7b using Algorithm 1 with tolerance tol for 100 sample parameters (In both plots the statistical errors, i.e., the variations between samples, are very small and therefore not shown.) Figure 7a displays effective ranks for parameters Θ of the three different types. The parameters of type (B1) have 8 and 4 blocks, respectively, and those types (B2) and (B3) have 4 blocks. The size of each block increases in the number d of automata, i.e., for 8 blocks the size of each block is b ∈ {1, . . . , 6} and for 4 blocks it is b ∈ {2, . . . , 12}. The automata belonging to a block are each grouped by the tree structure as in Fig. 4.
We observe that p is approximated to a tolerance of tol = 10 −4 with low rank in all cases. For separated blocks of type (B1), drawn in blue, the effective rank increases slightly in the number d of automata. Since the number of blocks is kept constant, also the size of each block increases in d. For this reason, especially for 4 blocks, there is a superposition of the effects of increasing d and increasing block size on the effective rank. This becomes particularly clear in the comparison between 8 and 4 blocks per parameter. In case of 8 blocks, the effective rank increases much more flatly and seems to be almost independent of the number d of automata constrained by r eff ≈ 8. The larger 4 blocks always combine the automata of two smaller blocks, i.e., they are twice as large. We thus hold that although the approximation ranks increase slightly with the number d of automata as well as the number b of directly interacting automata, they remain comparatively small, provided that the partitioning in the dimensional tree is preserved. As in Fig. 5, separating the automata of a block in the dimension tree can lead to an increase in the approximation rank (not shown here). A similar increase can be observed when further effects are added between neighboring blocks, i.e., (B2), or between randomly chosen blocks, i.e., (B3). In Fig. 7a the effective ranks for parameters with additional random effects, drawn in red, both increase in d until 24 and then remain almost constant at around r eff ≈ 20 for (B2) or slightly decreases respectively for (B3). This suggests that effects that do not fit the structure of the dimension tree can lead to an increase in the approximation rank. In addition, direct effects between neighboring blocks appear to have less impact on the approximation rank than those between more distant ones. For parameters of type (B3), we observe more significant differences between effective ranks for different samples. For some, the effective ranks are significantly above the mean, for some below it. This supports the conjecture that the location (i, j) of the direct effects Θ i, j = 1 affects the approximation rank. However, if only a few of these effects are present, the marginal distribution can still be approximated with comparatively low approximation rank. These results suggest that neither the number d of automata nor the size of the blocks alone are responsible for an increase in rank, but that especially the distribution of the automata in the dimension tree has a large effect. How to construct an appropriate tree in order to keep ranks low using a-priori information on the parameters is a topic of ongoing research.
The semi-logarithmic plot in Fig. 7b displays effective ranks for parameters Θ of type (B1) for d = 32 automata and 4 blocks of size b = 8 as a function of the maximum relative truncation error ε trunc for different tolerances tol = 10 −1 , . . . , 10 −4 . As the tolerance is lowered, the approximation of p corresponds to a particular stage during the iteration. We observe that all effective ranks are nearly constant in the maximum relative truncation error and the tolerance for tol. For tol = 10 −1 , the effective rank is slightly lower and for all other tolerance values it is about r eff ≈ 9. This observation suggests, on the one hand, that p can be accurately approximated with a tensor of effective rank r eff ≈ 9, since a more accurate truncation, i.e., smaller ε trunc , has only very small impact on the approximation ranks. On the other hand, the fact that the ranks are low and nearly independent of the tolerance value indicates that the ranks during the iteration are also low. This allows not only for efficient storage of the resulting approximation but also for efficient computation using Algorithm 1. For parameters of types (B2) and (B3), we observed that the increase in the approximation rank is further amplified for smaller maximum relative truncation errors ε trunc . Therefore, a suitable truncation accuracy seems to become more important for parameters with more direct effects between blocks. For Fig. 7a, the maximum relative truncation error ε trunc = 10 −7 is chosen to be comparatively small, but the approximation ranks remain relatively small for all numbers d of automata and types of parameters. During the iteration itself, the effective ranks remained almost constant (after a start phase) similar as shown in Fig. 7b.

Study of the speed of convergence
Having studied the low-rank structure of the distribution numerically, we now consider the convergence speed of our method. In Theorem 1 we proved that the iteration sequence converges linearly to the marginal distribution. If the truncation error is small enough, then the convergence of the method combined with truncation is preserved, cf. (Hackbusch et al. 2008). We will now see that this theoretical result also holds in practice.
To do this, we look at the decay of the relative residual as a function of the iteration steps. Again we use 100 parameter samples of type (B1) with 4 blocks for d = 8, 16 and 32 automata each. Figure 8a and b show semi-logarithmic plots of the norm of the relative residual as a function of the iteration step. Figure 8a displays the mean value (a) (b) Fig. 8 Norm of the relative residual as a function of the number of iteration steps using Algorithm 1 for 100 sample parameters of type (B1) with d automata and 4 blocks of the relative residual and Fig. 8b additionally the corresponding box plot illustrating the variances for d = 32 automata. We observe a linear convergence of the method for all values of d. For larger number d of automata the convergence slows down. A similar behavior can be observed for larger standard deviations σ of the logarithmic parameters log Θ i, j , where a larger σ means that also the parameters Θ i, j are more widely dispersed around 1. Both observations can be explained by the fact that γ (and our upper bound in (20)) increases in d and Θ i, j > 1, and thus the bound γ 1+γ on the convergence rate is closer to one, cf. Theorem 1. In Fig. 8b the ranges for d = 32 given by the boxes are small, which indicates that there are only a few outliers given by the whiskers. We have confirmed our results using different numbers d of automata and blocks (not shown). In our tests we observe that the number of iteration steps needed to achieve a certain tolerance grows linearly in the number d of automata but is nearly independent of the block size. This indicates that the number of iteration steps is independent of changes in the ordering of automata and consequently of changes in the rank.

Conclusion and future work
Inspired by current research in tumor progression models, we considered a class of continuous-time Markov chains that describe interacting processes, e.g., tumor progression models. Typically the age of a tumor at its diagnosis is unknown. For this reason, the transient distribution integrated over the exponentially distributed observation time is required. This so-called time-marginal distribution is uniquely defined as the solution a large linear system and suffers from the problem of state-space explosion. We modeled this class of Markov chains with separable transition rates factorizing according to the current state using Stochastic Automata Networks. This modeling enabled us to obtain a low-rank tensor representation of the operator and the right-hand side of this linear system. Based on these low-rank tensor representations, we derived an iterative method to compute a low-rank tensor approximation of the time-marginal distribution and hence were able to overcome the state-space explosion. The method guarantees that the entries of the approximation sum up to one as required for a probability distribution. We proved the convergence of the method. In numerical experiments focused on the concept of Mutual Hazard Networks we illustrated that the time-marginal distribution is well approximated with low rank. The method allows for consistently low ranks during the iteration, and linear convergence was observed independently of the number of processes/automata.
A probability distribution, in addition to being normalized to one, must be nonnegative. An approximation of a probability distribution should also satisfy this condition. In numerical experiments we observed that our method preserves this nonnegativity for small numbers of automata if the truncation is sufficiently accurate. How to guarantee non-negativity and at the same time convergence for high numbers of automata will be part of our future research. Moreover, we observed that the approximation rank for the time-marginal distribution depends strongly on the structure of the dimension tree and on the effects between automata. To minimize the approximation rank we plan to develop a strategy to determine an optimal dimension tree a-priori.