Differentiated uniformization: a new method for inferring Markov chains on combinatorial state spaces including stochastic epidemic models

We consider continuous-time Markov chains that describe the stochastic evolution of a dynamical system by a transition-rate matrix Q which depends on a parameter 𝜃 . Computing the probability distribution over states at time t requires the matrix exponential exp ( tQ ) , and inferring 𝜃 from data requires its derivative 𝜕 exp ( tQ ) ∕ 𝜕𝜃 . Both are challenging to compute when the state space and hence the size of Q is huge. This can happen when the state space consists of all combinations of the values of several interacting discrete variables. Often it is even impossible to store Q . However, when Q can be written as a sum of tensor products, computing exp ( tQ ) becomes feasible by the uniformization method, which does not require explicit storage of Q . Here we provide an analogous algorithm for computing 𝜕 exp ( tQ ) ∕ 𝜕𝜃 , the differentiated uniformization method . We demonstrate our algorithm for the stochastic SIR model of epidemic spread, for which we show that Q can be written as a sum of tensor products. We estimate monthly infection and recovery rates during the first wave of the COVID-19 pandemic in Austria and quantify their uncertainty in a full Bayesian analysis. Implementation and data are available at https:// github. com/ spang-lab/ TenSIR.


Introduction
Predicting the time evolution of complex dynamical systems has a wide range of applications in medicine and public health.One of them is the SIR model of epidemic spread, which describes a population by the numbers of susceptible (S), infected (I) and recovered (R) people.Until recently the SIR model has been approximated deterministically (Kermack and McKendrick 1927) and was considered computationally intractable in its stochastic formulation (McKendrick 1925).The stochastic SIR model is a continuous-time Markov chain (CTMC) in which infections happen randomly with a rate proportional to S and proportional to I (Allen 2017).The state of the system at a given time is thus fully specified by the combination of S and I. Since infections happen randomly one must keep track of a huge number of probabilities, one for every possible state.For example, the Austrian population of 9 million people can go through roughly 9 million × 9 million = 81 trillion possible states during the course of an epidemic.
More generally, we consider CTMCs that describe the evolution of a probability distribution p(t) over a huge discrete state space according to the Kolmogorov forward equation dp(t) dt = Qp(t) with solution p(t) = exp(tQ)p(0). ( Here Q is the transition-rate matrix and exp(tQ) is the matrix exponential.For an SIR model of the Austrian population Q has 81 trillion × 81 trillion entries, and naively computing the matrix exponential requires on the order of 81 trillion × 81 trillion × 81 trillion operations (Moler and Van Loan 2003), which is practically impossible.Even more dauntingly, when Q depends on an unknown parameter θ, such as the infection or recovery rate in the SIR model, we must first infer θ from data by maximizing its likelihood or by sampling from its posterior in a full Bayesian analysis.This typically requires the derivative of the matrix exponential ∂ exp(tQ)p(0)/∂θ in order to compute ∂p(t)/∂θ.However, Ho, Crawford, et al. (2018) have recently provided an algorithm that solves the Kolmogorov equation in the Laplace domain and evaluates the inverse Laplace transform numerically, thus avoiding the matrix exponential.Their algorithm is applicable to systems where each discrete variable increases monotonically.This includes the SIR model,1 for which their algorithm scales quadratically in the population size.
Here, we provide an alternative algorithm that directly computes exp(tQ) and ∂ exp(tQ)/∂θ.For the SIR model it scales cubically in the population size but is still practical.Importantly, our approach is applicable to a broader class of CTMCs with large state spaces that arise from interacting discrete variables, without requiring monotonicity.For example, in tumor progression models the states are combinations of possible mutations (Beerenwinkel andSullivant 2009, Schill et al. 2019), in stochastic neural networks the states are activation patterns of neurons (Yamanaka et al. 1997), in predator-prey dynamics they are joint population sizes of interacting species (Owen et al. 2014), or in chemical reaction networks they are joint counts of chemical species (Wolf 2007).
For many of these models Q can be written as a sum of tensor products (P.Buchholz 1999).We provide such a representation for the stochastic SIR model.To the best of our knowledge, this representation is novel.We use it for matrix-vector products that do not require explicit storage of Q (Buis and Dyksen 1996) and make computation of the matrix exponential tractable via the uniformization method (Grassmann 1977).A similar approach by Sherlock (2021) exploits the sparsity of Q.We extend the uniformization method and provide an analogous algorithm that also computes the derivative of the matrix exponential.Finally, we use Hamiltonian Monte Carlo sampling to provide a full Bayesian analysis of the first wave of the COVID-19 pandemic for the Austrian population, shedding new light on the uncertainties associated with the estimation of infection and recovery rates.

Differentiated uniformization for parameter estimation
A discrete-state, continuous-time Markov chain (CTMC) describes probability distributions p(t) ∈ R |X| over a state space X, where an entry p(t) x denotes the probability that the CTMC is in state x ∈ X at time t ∈ [0, ∞).Its change over time is governed by the Kolmogorov forward equation with transition-rate matrix Q ∈ R |X|×|X| , where an off-diagonal entry Q y,x is the transition rate from state x ∈ X to state y ∈ X and diagonal entries are set such that columns sum to zero.The solution to eq. ( 2) is given by the matrix exponential which could be approximated in principle by terminating after a finite number of terms.However, catastrophic cancellations occur (Moler and Van Loan 2003) due to the fact that Q has negative entries and negative eigenvalues. 2 The uniformization method (Grassmann 1977) addresses this problem by introducing a strictly nonnegative matrix such that does not suffer from cancellations.P can be viewed as the transition probability matrix of a discrete-time Markov chain where the number of transitions is a Poisson-distributed random variable with mean γt.
Using the recursions p(t) can be computed according to eq. ( 5) by algorithm 1 (Grassmann 1977).Note that P n p(0) sums to 1 and hence eq. ( 5) sums to less than 1 when terminated after a finite number of terms.The algorithm stops once this probability mass defect is smaller than a preset tolerance ε.The required number m of iterations is in O(γ) (Reibman and Trivedi 1988) and can be determined, e.g., using the numerically robust method by Sherlock (2021).
In this paper we are interested in statistical models where Q depends on a parameter θ that we want to estimate from data by maximizing its likelihood or by sampling from its posterior.To this end, we propose a novel algorithm for computing the derivative building on the uniformization method.We use the recursions ( 6), ( 7) and additionally to compute p(t) := ∂p(t)/∂θ according to eq. ( 9) by algorithm 2.
Algorithm 1: Uniformization input : p(0), t, P, γ, ε output: p(t) Algorithm 2: Differentiated Uniformization input : p(0), t, P, P , γ, γ , ε output: p(t), p(t) Applying differentiated uniformization for a particular statistical model requires the scalars where a generic choice for γ can be the 2-norm of the diagonal of Q or any p-norm with even p.It also requires the operators Crucially, these operators are only needed for matrix-vector products in lines 11 and 12 of algorithm 2 and do not need to be stored explicitly.This makes our method especially useful for models where Q is large but has a compact representation as a sum of tensor products, which allows one to cheaply compute matrix-vector products (Buis and Dyksen 1996).
Differentiated uniformization thus opens the door to parameter inference for CTMCs on huge discrete state spaces.Let {x 1 , . . ., x K } be observations of the Markov chain at corresponding time points {t 1 , . . ., t K }.We represent each data point by an empirical probability distribution δ(t k ) ∈ R |X| , where δ(t k ) x k = 1 and all other entries are zero.The likelihood of θ for a single observation of state x k at time t k with k > 1 is The log-likelihood for the whole data set, can be maximized using its derivative for example by gradient ascent.This derivative can also be used for sampling a posterior distribution of θ in a full Bayesian model using a Hamiltonian Monte Carlo method (Gelman et al. 2013).

Modeling epidemic spread
The most basic models of epidemic spread are SIR models, which describe the numbers of susceptible (S), infected/infectious (I) and recovered (R) people during an epidemic in a closed population of constant size N .There is a widely known deterministic and a lesser-known stochastic variant of the SIR model (Allen 2017).The latter involves a huge discrete state space and is therefore challenging to use.However, it allows for uncertainty quantification in its dynamics and in inferred parameters.The deterministic SIR model (Kermack and McKendrick 1927) assumes that S(t), I(t), R(t) ∈ [0, N ] are continuous and describes their evolution over time t ∈ [0, ∞) by the following system of nonlinear ordinary differential equations: where α, β ∈ R + are parameters.Note that once S(t) and I(t) are given, R(t) = N − S(t) − I(t) is already determined and can be omitted in further analysis.
In words, an infection occurs when a susceptible person comes in sufficiently close contact with an infected person, which happens proportionally to the number of susceptible and to the density of infected people in the population and proportionally to an infection rate β.This rate β encompasses, for example, disease characteristics, people's behavior, public policy and weather.An infected person recovers with rate α and can then no longer become susceptible or infected again.The basic reproduction number R 0 := β/α is the number of people (in a fully susceptible population) that one infected person infects before recovering.
There is no analytical solution to system ( 16), but it can be solved numerically, for example by Euler's method: The black curve in Figure 1a illustrates this solution for given parameters α = 1w −1 , β = 2.5w −1 and initial conditions N = 500, I(0) = 3, S(0) = 497.This model has several limitations.First, an epidemic is in fact a stochastic process and not a deterministic dynamical system.Second, without modeling the stochastics explicitly it is not possible to quantify the uncertainties of inferred parameters, which contributes to the uncertainties in the course of the epidemic.
The stochastic SIR model (McKendrick 1925;Allen 2017) is a continuous-time Markov chain over all possible states of the population.A state is a pair of integers (S, I) ∈ {0, . . ., N } × {0, . . ., N } denoting the number of susceptible and infected people.States with S + I > N are unreachable but still accounted for in the model. 3et p(t) ∈ R (N +1) 2 denote the probability distribution at time t over all states (S, I).That is, p(t) (S,I) is the probability that at time t there are S susceptible and I infected people.Its time evolution is governed by the Kolmogorov forward equation where the matrix Q ∈ R (N +1) 2 ×(N +1) 2 contains the transition rates from a state (S, I) to a state (S + ∆S, I + ∆I): The blue and red curves in Figure 1a depict 10 randomly sampled trajectories where transitions happen according to the rates in eq. ( 19), generated by the Gillespie (1976) algorithm.The trajectory highlighted in red shows how stochastic fluctuations, especially at the beginning of the epidemic, can drastically alter the shape of the curve compared to its deterministic counterpart.Figure 1b shows the analytic solution to eq. ( 18) and further illustrates that the stochasticity is not merely additive noise around the deterministic solution.In particular, the stochastic SIR model allows for a bifurcation where the epidemic dies out in the beginning with nonzero probability.
The parameters α, β ∈ R + can be inferred from data using differentiated uniformization.This requires multiple matrix-vector products with Q which is, however, too large to be stored explicitly, even for populations of only thousands of people.Hence, we propose a novel representation of Q that does not require explicit storage.To this end, we introduce band matrices of size (N + 1) × (N + 1): This yields a representation of the transition-rate matrix as a sum of tensor products4 (see Figure 2 for an illustrated explanation).Note that eq. ( 21) is not an approximation but an exact reformulation of eq. ( 19).The benefit of this representation is that its storage complexity is O(N ) rather than O(N 4 ) and that performing matrix-vector products has a complexity in only O(N 2 ) (Buis and Dyksen 1996) rather than O(N 4 ).Additionally, differentiated uniformization requires the derivative ∂Q/∂θ.Here we perform inference with respect to logarithmic parameters θ = (log α, log β) in order to ensure the positivity constraint on α and β: Finally, differentiated uniformization requires a differentiable upper bound γ on the absolute diagonal entries of Q.For the SIR model we choose the exact maximum γ = max It is differentiable5 for α = (N − 1)β with Overall, differentiated uniformization performs O(γ) matrix-vector products and thus has a total runtime complexity in O(γN 2 ) = O(N 3 ) for the SIR model.It requires storage of the result p(t), which has complexity O(N 2 ).
For parameter inference we are typically only interested in the likelihood that an earlier data point (S, I) is followed by a later data point (S + ∆S, I + ∆I) after time t.Since the number of susceptibles cannot increase (∆S ≤ 0) and the number of recovered cannot decrease (∆R = −∆S − ∆I ≥ 0) along a trajectory, it is sufficient to compute p(t) and p(t) on the restricted state space {S + ∆S, . . ., S} × {I − ∆R, . . ., I − ∆S}, as explained in Appendix A. Following Ho, Crawford, et al. (2018) we use this state-space restriction to reduce the time complexity of our algorithm to O (I + |∆S|)(∆S 2 + |∆S|∆R) and its storage complexity to O(∆S 2 + |∆S|∆R).Here we model the first wave of the COVID-19 pandemic in Austria as a stochastic SIR model.We employ differentiated uniformization to estimate the parameters α and β and quantify their uncertainty.We use daily numbers on S, I and R between 2020-03-01 and 2020-09-01 from public health data provided by the Austrian Bundesministerium für Soziales (2021) (Figure 3).I and R are given directly, and we set S = N − I − R assuming that the initial population size N = 8,932,664 stays constant.People who have died from COVID-19 are counted under "recovered" in a technical sense as they are no longer infectious.We do not correct for undiscovered cases and biases in testing and reporting.We also assume that parameters are piecewise constant for each month.We do a full Bayesian analysis for parameter pairs (log α, log β) with a uniform prior.Following Ho, Crawford, et al. (2018) we sample from the joint posterior using a Hamiltonian Monte Carlo (HMC) scheme (Duane et al. 1987;Neal 2011).Unlike a standard Metropolis-Hastings scheme, HMC makes use of the gradient of the likelihood, which we compute using differentiated uniformization.This makes sampling more efficient with less samples needed to cover the posterior distribution (Gelman et al. 2013).We estimated the joint posterior of (log α, log β) for every month between March 2020 and August 2020 separately.For each month we performed 10 parallel Monte Carlo chains with length 100, where we discarded the first 10 points each, resulting in 900 points per month.These calculations were done on the QPACE 4 cluster (Georg, Meyer, et al. 2021) and took about 30 minutes for each of May and June, 2 hours for July, 16 hours for each of April and August and 4 days for March.
Figure 4 shows the results of this analysis.The estimated posterior is plotted for (α, β) on logarithmic scales.The gray shaded areas were generated using Gaussian-kernel density estimation applied to the posterior samples.The crosses mark the least-squares estimators of the corresponding deterministic SIR models.The dashed lines represent parameter constellations where α = β and thus R 0 = 1.Here the epidemic switches between growing and decreasing numbers of infected.From April-August 2020 the posterior of the recovery rate α varies around a value of 0.07 per day, corresponding to the realistic mean time to recovery of about 2 weeks (Faes et al. 2020).In contrast, the posterior of α in March 2020 appears to be off, with a mean of about 0.03 per day corresponding to a mean time to recovery of one month.Inspecting the original numbers, we observed that the numbers of recovered are unexpectedly low (less than 100 people until 2020-03-23) possibly due to lagging declaration of recoveries because of cautious hospital policies in the beginning of the pandemic.
Overall, we observe large uncertainties associated with the parameters in several but not all months.These might hint at epidemic courses that are not perfectly in line with an SIR model or with the assumption of piecewise constant parameters.Such deviations from the model assumptions are much less readily apparent in deterministic approaches.

Discussion
We provide a novel method for computing the transient distribution and its derivative for continuous-time Markov Chains on huge discrete state spaces.This makes parameter inference tractable for a large family of statistical models, including the stochastic SIR model of epidemic spread.
Our key observation is that the transition-rate matrix of an SIR model can be written as a sum of tensor products, which allows us to cheaply compute matrix-vector products without storing the matrix itself.This operation alone is sufficient to compute the transient distribution by the uniformization method (Grassmann 1977), a numerically stable power-series expansion of the matrix exponential.We propose the differentiated uniformization method, an analogous power series for computing the derivative of the transient distribution with respect to parameters of a CTMC.
For the SIR model our algorithm scales cubically in the size of the population, which is one order slower than the state-of-the-art method for multivariate birth processes (Ho, Crawford, et al. 2018).On the other hand, our generalpurpose algorithm also applies to birth-death processes such as predator-prey dynamics (Owen et al. 2014), which have been considered intractable so far (Ho, Xu, et al. 2017).We illustrate this in Appendix B.
Beyond epidemiology we are interested in tumor progression modeling using mutual hazard networks (Schill et al. 2019).Similar to an epidemiological model that scales exponentially in the number of compartments, a tumor progression model is a CTMC that scales exponentially in the number of possible mutations.While both differentiated uniformization and the algorithm of Ho, Crawford, et al. (2018) have the potential to advance this field, large scale inference remains an open problem for tumor progression models with up to hundreds of mutations.The tensor representation of the transition-rate matrix could serve as a starting point for representing the transient distribution itself in a low-rank tensor format.These formats reduce the exponential cost (e.g., in the number of mutations or compartments) to linear cost provided certain low-rank structures exist (Hackbusch 2012).For large-scale CTMCs, low-rank tensor formats were already successfully used, e.g., for the computation of transient (Johnson et al. 2010) and stationary distributions (Benson et al. 2017;Peter Buchholz et al. 2016;Kressner and Macedo 2014) and also for a variant of the uniformization method (Georg, Grasedyck, et al. 2020).Therefore, the combination of low-rank tensor formats and differentiated uniformization could be a promising new avenue for large-scale inference problems in computational oncology and epidemiology.
Note that the columns of Q sum to less than zero and that p(t) therefore sums to less than 1 on the restricted state space.Computing matrix-vector products using these operators has a time complexity in O(|∆S| 2 + |∆S|∆R).
The largest absolute diagonal entry of Q is We perform m iterations of algorithm 2 such that the entire probability mass (including that which left the restricted state space) according to eq. ( 8) reaches the required tolerance.Hence, the overall time complexity of the algorithm is Storing the result p(t) has complexity O(|∆S| 2 + |∆S|∆R).
B Predator-prey dynamics  We consider the following deterministic predator-prey equations, based on (Lotka 1925;Volterra 1926) (37) which describe how the population size X(t) ∈ R + of a prey species and the population size Y (t) ∈ R + of a predator species change continuously over time as the species interact (black curve in Figure 5a).The parameter α is the birth rate of prey, δ is the death rate of predators and β is the contact rate between predators and prey.Upon contact a prey is consumed and, we assume for simplicity, exactly one predator is born.X max is a finite carry capacity representing the available plant resources for the prey species, which would result in logistic growth in the absence of predators.This is neither necessary nor commonly assumed in the literature on the deterministic model, since the prey population is always limited by a nonzero number of predators in R + .
(a) Solution of a deterministic SIR model (black curve) and 10 randomly sampled trajectories of the corresponding stochastic SIR model (blue and red).The trajectory highlighted in red deviates drastically from the deterministic solution.(b) Analytic solution of the Kolmogorov equation for a stochastic SIR model.The time slices show distributions p(t) where the shades of blue show the smallest (not necessarily contiguous) areas that contain 30%, 60% and 90% of the probability mass at time t.

Figure 2 :Figure 3 :Figure 4 :
Figure 2: Illustration of Q for a population of size N = 3 given by its entry-wise representation in eq.(19) (top) and its tensor representation in eq.(21) (bottom).Blue numbers indicate susceptibles, red numbers indicate infected and blank entries in the matrices are zero.Transitions should be read from columns to rows.S + inf : An infection decreases the number of susceptibles by one and happens proportionally to the current number of susceptibles.I + inf : At the same time, an infection increases the number of infected by one and happens proportionally to the current number of infected.The tensor product ⊗ combines both these transitions for a single infection.Moreover, an infection happens inversely proportional to the total population size N and proportionally to the parameter β. S + rec : A recovery does not change the number of susceptibles.I + rec : At the same time, a recovery decreases the number of infected by one and happens proportionally to the current number of infected.The tensor product ⊗ combines both these transitions for a single recovery.Moreover, a recovery happens proportionally to the parameter α.The matrices S − inf , I − inf , S − rec , I − rec generate corresponding negative entries for the diagonal of Q.
(a) Solution of a deterministic predator-prey model (black curve) and 10 randomly sampled trajectories (blue) of the corresponding stochastic model.(b) Analytic solution of the Kolmogorov equation for a stochastic predator-prey model.The time slices show distributions p(t) where the shades of blue show the smallest areas that contain 30%, 60% and 90% of the probability mass at time t.