1 Introduction

Predicting the time evolution of complex dynamical systems has a wide range of applications in medicine and public health. One of them is the SIR model of epidemic spread, which describes how the numbers of susceptible (S), infected (I) and recovered (R) people in a population change over the course of an epidemic by a system of differential equations. In this simple model the total population size stays constant, i.e., there are no births or deaths, there is no migration, and people get infected and recover at most once. The population has no demographic structure and no geographic structure, i.e., all individuals meet each other randomly. More complex models extend the SIR model in order to account for these limitations (Tang et al. 2020).

Until recently even the basic SIR model has been approximated deterministically (Kermack and McKendrick 1927) and was considered computationally intractable in its stochastic formulation (McKendrick 1925). The stochastic SIR model is a continuous-time Markov chain (CTMC) in which infections happen randomly with a rate proportional to S and proportional to I (Allen 2017). The state of the system at a given time is the tuple (SIR) which for a constant population size N is already specified by the tuple (SI) since \(R=N-S-I\). The state space is therefore \(\{ 0, \dots , N \} \times \{ 0, \dots , N \}\) with size \((N+1)\times (N+1)\). Since infections happen randomly one must keep track of a huge number of probabilities, one for every possible state.Footnote 1

More generally, we consider a CTMC which describes probability distributions \(\textbf{p}(t)\in \mathbb {R}^{|X|}\) over a discrete state space X, where an entry \(\textbf{p}(t)_x\) denotes the probability that the CTMC is in state \(x \in X\) at time \(t\in [0, \infty )\). Its change over time is governed by the Kolmogorov forward equation

$$\begin{aligned} \frac{\textrm{d}\textbf{p}(t)}{\textrm{d}t} = Q \textbf{p}(t)\end{aligned}$$
(1)

with transition rate matrix \(Q \in \mathbb {R}^{|X| \times |X|}\), where an off-diagonal entry \(Q_{y,x}\) is the instantaneous transition rate from state \(x \in X\) to state \(y \in X\) and diagonal entries are set such that columns sum to zero. The solution to the Kolmogorov equation is given by the action of the matrix exponential,

$$\begin{aligned} \textbf{p}(t)= \exp \,\left( tQ\right) \, \textbf{p}(0)&= \sum _{n=0}^{\infty } \frac{t^n}{n!} Q^n \textbf{p}(0)\end{aligned}$$
(2)

whose complexity is quadratic in |X| when the sum is truncated after an appropriate number of terms. For example, an SIR model of the Austrian population with 9 million people has a state space X of size \(9\text { million}\times 9\text { million} = 81\text { trillion}\). The matrix Q has \(81\text { trillion}\times 81\text { trillion}\) entries and naively applying the matrix exponential on a vector is practically impossible, as well as numerically unstable.

Even more dauntingly, when Q depends on an unknown parameter \(\theta \), such as the infection or recovery rate in the SIR model, we must first infer \(\theta \) from data. This can be done, for example, by maximizing its likelihood, which is an optimization problem that can be solved more efficiently if the derivative

$$\begin{aligned} \frac{\partial \textbf{p}(t)}{\partial \theta } = \frac{\partial \exp \,\left( tQ\right) \,}{\partial \theta } \, \textbf{p}(0)\end{aligned}$$

is available. Alternatively, \(\theta \) can by inferred by sampling from its posterior in a full Bayesian analysis, which is also more efficient if the derivative of the likelihood is available.

However, Ho et al. (2018) have recently provided an algorithm that solves the Kolmogorov equation in the Laplace domain and evaluates the inverse Laplace transform numerically, thus avoiding the matrix exponential. Their algorithm is applicable to systems where each discrete variable increases monotonically. This includes the SIR model,Footnote 2 for which their algorithm scales quadratically in the population size.

In this paper, we provide an algorithm that directly computes \(\exp \,\left( tQ\right) \,\) and, crucially, \(\partial \exp \,\left( tQ\right) \,/\partial \theta \) at the same time. For the SIR model it scales cubically in the population size but is still practical. Importantly, our approach is applicable to a broader class of CTMCs with large state spaces that arise from interacting discrete variables, without requiring monotonicity. For example, in tumor progression models the states are combinations of possible mutations (Beerenwinkel and Sullivant (2009), Schill et al. (2019)), in stochastic neural networks the states are activation patterns of neurons (Yamanaka et al. 1997), in predator–prey dynamics they are joint population sizes of interacting species (Owen et al. 2014), or in chemical reaction networks they are joint counts of chemical species (Wolf 2007).

For many of these models Q can be written as a sum of tensor products (Buchholz 1999). We provide such a representation for the stochastic SIR model. To the best of our knowledge, this representation is novel. We use it for matrix–vector products that do not require explicit storage of Q (Buis and Dyksen 1996) and make computation of the matrix exponential tractable via the uniformization method (Grassmann 1977). A similar approach by Sherlock (2021) exploits the sparsity of Q. We extend the uniformization method and provide an analogous algorithm that also computes the derivative of the matrix exponential. Finally, we use Hamiltonian Monte Carlo sampling to provide a full Bayesian analysis of the first wave of the COVID-19 pandemic for the Austrian population, shedding new light on the uncertainties associated with the estimation of infection and recovery rates.

2 Differentiated uniformization for parameter estimation

The action of the matrix exponential

$$\begin{aligned} \textbf{p}(t)= \exp \,\left( tQ\right) \, \textbf{p}(0)&= \sum _{n=0}^{\infty } \frac{t^n}{n!} Q^n \textbf{p}(0)\end{aligned}$$
(3)

could be approximated in principle by terminating after a finite number of terms. However, catastrophic cancellations occur (Moler and Van Loan 2003) due to the fact that Q has negative entries and negative eigenvalues.Footnote 3 The uniformization method (Grassmann 1977) addresses this problem by introducing a strictly nonnegative matrix

$$\begin{aligned} P := \frac{1}{\gamma }Q+{{\,\textrm{I}\,}}\quad \text {for some } \gamma \ge \underset{x}{\max }|Q_{x,x}| \end{aligned}$$
(4)

such that

$$\begin{aligned} \textbf{p}(t)= \exp \,\left( tQ\right) \, \textbf{p}(0)&= \exp \,\left( \gamma t (-{{\,\textrm{I}\,}}+ P)\right) \, \textbf{p}(0)\nonumber \\&= \exp \,\left( - \gamma t{{\,\textrm{I}\,}}\right) \, \exp \,\left( \gamma tP\right) \, \textbf{p}(0)\nonumber \\&= \sum _{n=0}^{\infty } e^{-\gamma t} \frac{(\gamma t)^n}{n!} P^n \textbf{p}(0) \end{aligned}$$
(5)

does not suffer from cancellations. P can be viewed as the transition probability matrix of a discrete-time Markov chain where the number of transitions is a Poisson-distributed random variable with mean \(\gamma t\).

Using the recursions

$$\begin{aligned} P^n&= PP^{n-1}, \end{aligned}$$
(6)
$$\begin{aligned} \frac{(\gamma t)^n}{n!}&= \frac{\gamma t}{n} \frac{(\gamma t)^{n-1}}{(n-1)!}, \end{aligned}$$
(7)

\(\textbf{p}(t)\) can be computed according to Eq. (5) by algorithm 1 (Grassmann 1977). Note that \(P^n \textbf{p}(0)\) sums to 1 and hence Eq. (5) sums to less than 1 when terminated after a finite number of terms. The algorithm stops once this probability mass defect

$$\begin{aligned} 1 - \sum _{n=0}^{m} e^{-\gamma t}\frac{(\gamma t)^n}{n!} \end{aligned}$$
(8)

is smaller than a preset tolerance \(\varepsilon > 0\). The required number m of iterations is in \(\mathcal {O}(\gamma )\) (Reibman and Trivedi 1988) and can be determined, e.g., using the numerically robust method by Sherlock (2021).

In this paper we are interested in statistical models where Q depends on a parameter \(\theta \) that we want to estimate from data by maximizing its likelihood or by sampling from its posterior. Both inference approaches benefit from utilizing gradient information. Al-Mohy and Higham (2009) proposed an efficient method to calculate derivatives of the matrix exponential for general matrices. Here, we propose a conceptually similiar algorithm specifically tailored towards transition rate matrices based on the uniformization method:

$$\begin{aligned} \frac{\partial {\textbf{p}(t)}}{\partial {\theta }}&= \frac{\partial {\exp \,\left( tQ\right) \,}}{\partial {\theta }}\, \textbf{p}(0)\nonumber \\&= \frac{\partial {}}{\partial {\theta }} \left( \sum _{n=0}^{\infty } e^{-\gamma t}\frac{(\gamma t)^n}{n!} P^n \textbf{p}(0)\right) \nonumber \\&= \sum _{n=0}^{\infty } e^{-\gamma t}\frac{(t\gamma )^n}{n!} \frac{\partial {P^n}}{\partial {\theta }} \textbf{p}(0)+ e^{-\gamma t} \frac{\partial {\gamma }}{\partial {\theta }} \left( -\frac{t^{n+1}\gamma ^n}{n!} + \frac{t^n\gamma ^{n-1}}{(n-1)!}\right) P^n \textbf{p}(0)\nonumber \\&=\sum _{n=0}^{\infty } e^{-\gamma t}\frac{(t\gamma )^n}{n!}\left( \frac{\partial P^n}{\partial \theta } \textbf{p}(0)+ \frac{\partial {\gamma }}{\partial {\theta }}\left( \frac{n}{\gamma }-t\right) P^n \textbf{p}(0)\right) \end{aligned}$$
(9)

We use the recursions (6), (7) and additionally

$$\begin{aligned} \frac{\partial {P^{n}}}{\partial {\theta }}&= \frac{\partial {P}}{\partial {\theta }}P^{n-1}+ P \frac{\partial {P}}{\partial {\theta }} P^{n-2} + \ldots + P^{n-2} \frac{\partial {P}}{\partial {\theta }}P+ P^{n-1} \frac{\partial {P}}{\partial {\theta }} \nonumber \\&=\frac{\partial {P}}{\partial {\theta }}P^{n-1} + P \left( \frac{\partial {P}}{\partial {\theta }} P^{n-2} + \ldots + P^{n-3} \frac{\partial {P}}{\partial {\theta }}P+ P^{n-2} \frac{\partial {P}}{\partial {\theta }} \right) \nonumber \\&=\frac{\partial {P}}{\partial {\theta }}P^{n-1} + P \left( \frac{\partial {P^{n-1}}}{\partial {\theta }} \right) \end{aligned}$$
(10)

to compute \(\textbf{p}(t)':=\partial \textbf{p}(t)/\partial \theta \) according to Eq. (9) by algorithm 2.

Algorithm 1
figure a

Uniformization

Algorithm 2
figure b

Differentiated Uniformization

Applying the differentiated uniformization for a particular statistical model requires the scalar

$$\begin{aligned} \gamma \ge \underset{x}{\max }|Q_{x,x}| \quad \text { and its derivative}\quad \gamma ' := \frac{\partial {\gamma }}{\partial {\theta }}. \end{aligned}$$
(11)

A generic choice for \(\gamma \) can be the 2-norm of the diagonal of Q or any p-norm with even p. It also requires the operators

$$\begin{aligned} P = \frac{1}{\gamma }Q+{{\,\textrm{I}\,}}\quad \text { and }\quad P' := \frac{\partial {P}}{\partial {\theta }}&= -\frac{1}{\gamma ^2}\frac{\partial {\gamma }}{\partial {\theta }}Q + \frac{1}{\gamma }\frac{\partial {Q}}{\partial {\theta }}. \end{aligned}$$
(12)

Crucially, these operators are only needed for matrix–vector products in lines 11 and 12 of algorithm 2 and do not need to be stored explicitly. This makes our method especially useful for models where Q is large but has a compact representation as a sum of tensor products, which allows one to cheaply compute matrix–vector products (Buis and Dyksen 1996).

Differentiated uniformization thus opens the door to parameter inference for CTMCs on huge discrete state spaces. Let \(\{x_1, \ldots , x_K\}\) be observations of the Markov chain at corresponding time points \(\{t_1, \ldots , t_K\}\). We represent each data point by an empirical probability distribution \(\varvec{\delta }(t_k) \in \mathbb {R}^{\vert X\vert }\), where \(\varvec{\delta } (t_k)_{x_k}=1\) and all other entries are zero. The likelihood of \(\theta \) for a single observation of state \(x_k\) at time \(t_k\) with \(k>1\) is

$$\begin{aligned} \textbf{p}(t_k)_{x_k} \text {, where }\textbf{p}(t_k) = \exp \,\big ((t_{k}-t_{k-1}) Q\big ) \varvec{\delta }(t_{k-1}). \end{aligned}$$
(13)

The log-likelihood for the whole data set,

$$\begin{aligned} \ell (\theta ) =\sum _{k=2}^{K}\log (\textbf{p}(t_k)_{x_k}), \end{aligned}$$
(14)

can be maximized using its derivative

$$\begin{aligned} \frac{\partial {\ell (\theta )}}{\partial {\theta }} =\sum _{k=2}^{K}\frac{\textbf{p}(t_k)_{x_k}'}{\textbf{p}(t_k)_{x_k}}, \end{aligned}$$
(15)

for example by gradient ascent. This derivative can also be used for sampling a posterior distribution of \(\theta \) in a full Bayesian model using a Hamiltonian Monte Carlo method (Gelman et al. 2013).

3 Modeling epidemic spread

The most basic models of epidemic spread are SIR models, which describe the numbers of susceptible (S), infected/infectious (I) and recovered (R) people during an epidemic in a closed population of constant size N.

The deterministic SIR model (Kermack and McKendrick 1927) assumes that \(S(t), I(t), R(t) \in [0,N]\) are continuous and describes their evolution over time \(t \in [0,\infty )\) by the following system of nonlinear ordinary differential equations:

(16)

where \(\alpha , \beta \in \mathbb {R}^+\) are parameters. Note that once S(t) and I(t) are given, \(R(t)=N-S(t)-I(t)\) is already determined and can be omitted in further analysis.

In words, an infection occurs when a susceptible person comes in sufficiently close contact with an infected person, which happens proportionally to the number of susceptible and to the density of infected people in the population and proportionally to an infection rate \(\beta \). This rate \(\beta \) encompasses, for example, disease characteristics, people’s behavior, public policy and weather. An infected person recovers with rate \(\alpha \) and can then no longer become susceptible or infected again. The basic reproduction number \(\mathcal {R}_0:= \beta / \alpha \) is the number of people (in a fully susceptible population) that one infected person infects before recovering.

There is no analytical solution to system (16), but it can be solved numerically, for example by Euler’s method:

$$\begin{aligned} S(t+\Delta t)&= S(t) - \beta \frac{S(t)I(t)}{N} \Delta t, \nonumber \\ I(t+\Delta t)&= I(t) + \beta \frac{S(t)I(t)}{N} \Delta t - \alpha I(t) \Delta t. \end{aligned}$$
(17)

The black curve in Fig. 1a illustrates this solution for given parameters \(\alpha =1w^{-1}\), \(\beta =2.5w^{-1}\) and initial conditions \(N=500\), \(I(0)=3\), \(S(0)=497\).

Fig. 1
figure 1

Illustration of SIR models with \(N=500\), \(\alpha =1w^{-1}\), \(\beta =2.5w^{-1}\), \(I(0)=3\), \(S(0)=497\)

This deterministic model has several limitations. First, an epidemic is in fact not a deterministic dynamical system but a stochastic process that depends on the random behavior of people and random duration of each infection. A person does not recover after exactly one week, but only after one week on average. Especially if the very first infected people recover by chance before they come in contact with other people, the epidemic may not even take off (flat blue curves in Fig. 1a). Also, a person does not infect exactly 2.5 people per week, but only 2.5 people per week on average. Whether a person infected early on happens to come in close contact with someone else after one week or two weeks may shift the whole course of the epidemic (red curve in Fig. 1a). Hence stochastic fluctuations especially at the beginning of the epidemic can drastically alter the shape of the curve compared to its deterministic counterpart. Only by considering the uncertainty in the course of the epidemic can policy makers make informed decisions, e.g., for allocating limited hospital capacities over time.

Another limitation of the deterministic model is that without modeling the stochastics explicitly it is not possible to quantify the uncertainties of inferred parameters, which contributes to the uncertainties in the course of the epidemic.

These limitations are alleviated by the stochastic SIR model (McKendrick 1925; Allen 2017) which is a continuous-time Markov chain over all possible states of the population. A state is a pair of integers \((S,I) \in \{ 0, \dots , N \} \times \{ 0, \dots , N \}\) denoting the number of susceptible and infected people. Because of the very large number of possible states the stochastic SIR model is more challenging and less widely adopted than the deterministic model.

Let \(\textbf{p}(t)\in \mathbb {R}^{(N+1)^2}\) denote the probability distribution at time t over all states (SI). That is, \(\textbf{p}(t)_{(S,I)}\) is the probability that at time t there are S susceptible and I infected people. Its time evolution is governed by the Kolmogorov forward equation

$$\begin{aligned} \frac{\textrm{d}\textbf{p}(t)}{\textrm{d}t} = Q \textbf{p}(t), \end{aligned}$$
(18)

where the matrix \(Q \in \mathbb {R}^{(N+1)^2 \times (N+1)^2}\) contains the transition rates from a state (SI) to a state \((S+\Delta S, I+\Delta I)\):

$$\begin{aligned} Q_{(S+\Delta S, I+\Delta I),(S,I)} = {\left\{ \begin{array}{ll} \beta \frac{S I}{N} &{} \text{ if } \Delta S=-1, \Delta I=+1, \\ \alpha I &{} \text{ if } \Delta S= 0, \Delta I=-1, \\ -\beta \frac{S I}{N}-\alpha I &{} \text{ if } \Delta S= 0, \Delta I= 0, S\ne 0, I \ne N, \\ -\alpha I &{} \text{ if } \Delta S= 0, \Delta I= 0, S=0 \text { or } I=N, \\ 0 &{} \text{ otherwise }.\\ \end{array}\right. } \end{aligned}$$
(19)

The blue and red curves in Fig. 1a depict 10 randomly sampled trajectories where transitions happen according to the rates in Eq. (19), generated by the Gillespie (1976) algorithm. Figure 1b shows the analytic solution to Eq. (18) and further illustrates that the stochasticity is not merely additive noise around the deterministic solution.

The parameters \(\alpha , \beta \in \mathbb {R}^+\) can be inferred from data using differentiated uniformization. This requires multiple matrix–vector products with Q which is, however, too large to be stored explicitly, even for populations of only thousands of people. Hence, we propose a novel representation of Q that does not require explicit storage. To this end, we introduce band matrices of size \({(N+1) \times (N+1)}\):

$$\begin{aligned} \begin{aligned} \mathcal {S}^+_{\text {inf}}&= \text {superdiag}(1, \dots , N),\\ \mathcal {S}^-_{\text {inf}}&= \text {diag}(0, \dots , N),\\ \mathcal {S}^+_{\text {rec}}&= \text {diag}(1, 1, \dots , 1)={{\,\textrm{I}\,}},\\ \mathcal {S}^-_{\text {rec}}&= \text {diag}(1, 1, \dots , 1)={{\,\textrm{I}\,}}, \end{aligned}&\qquad \qquad \begin{aligned} \mathcal {I}^+_{\text {inf}}&= \text {subdiag}(0, \dots , N-1),\\ \mathcal {I}^-_{\text {inf}}&= \text {diag}(0, \dots , N-1, 0),\\ \mathcal {I}^+_{\text {rec}}&= \text {superdiag}(1, \dots , N),\\ \mathcal {I}^-_{\text {rec}}&= \text {diag}(0, \dots , N). \end{aligned} \end{aligned}$$
(20)

This yields a representation of the transition-rate matrix

$$\begin{aligned} Q = \frac{\beta }{N} (\mathcal {S}^+_{\text {inf}} \otimes \mathcal {I}^+_{\text {inf}}) + \alpha (\mathcal {S}^+_{\text {rec}} \otimes \mathcal {I}^+_{\text {rec}}) - \frac{\beta }{N} (\mathcal {S}^-_{\text {inf}} \otimes \mathcal {I}^-_{\text {inf}}) - \alpha (\mathcal {S}^-_{\text {rec}} \otimes \mathcal {I}^-_{\text {rec}}) \end{aligned}$$
(21)

as a sum of tensor productsFootnote 4 (see Fig. 2 for an illustrated explanation). Note that Eq. (21) is not an approximation but an exact reformulation of Eq. (19). The benefit of this representation is that its storage complexity is \(\mathcal {O}(N)\) rather than \(\mathcal {O}(N^4)\) and that performing matrix–vector products has a complexity in only \(\mathcal {O}(N^2)\) (Buis and Dyksen 1996) rather than \(\mathcal {O}(N^4)\).

Fig. 2
figure 2

Illustration of Q for a population of size \(N=3\) given by its entry-wise representation in Eq. (19) (top) and its tensor representation in Eq. (21) (bottom). Blue numbers indicate susceptibles, red numbers indicate infected and blank entries in the matrices are zero. Transitions should be read from columns to rows. \(\mathcal {S}^+_{\text {inf}}\): An infection decreases the number of susceptibles by one and happens proportionally to the current number of susceptibles. \(\mathcal {I}^+_{\text {inf}}\): At the same time, an infection increases the number of infected by one and happens proportionally to the current number of infected. The tensor product \(\otimes \) combines both these transitions for a single infection. Moreover, an infection happens inversely proportional to the total population size N and proportionally to the parameter \(\beta \). \(\mathcal {S}^+_{\text {rec}}\): A recovery does not change the number of susceptibles. \(\mathcal {I}^+_{\text {rec}}\): At the same time, a recovery decreases the number of infected by one and happens proportionally to the current number of infected. The tensor product \(\otimes \) combines both these transitions for a single recovery. Moreover, a recovery happens proportionally to the parameter \(\alpha \). The matrices \(\mathcal {S}^-_{\text {inf}}\), \(\mathcal {I}^-_{\text {inf}}\), \(\mathcal {S}^-_{\text {rec}}\), \(\mathcal {I}^-_{\text {rec}}\) generate corresponding negative entries for the diagonal of Q

Additionally, differentiated uniformization requires the derivative \(\partial Q/\partial \theta \). Here we perform inference with respect to logarithmic parameters \(\theta = (\log \alpha , \log \beta )\) in order to ensure the positivity constraint on \(\alpha \) and \(\beta \):

$$\begin{aligned} \frac{\partial Q}{\partial \log \alpha }&= \alpha (\mathcal {S}^+_{\text {rec}} \otimes \mathcal {I}^+_{\text {rec}}) - \alpha (\mathcal {S}^-_{\text {rec}} \otimes \mathcal {I}^-_{\text {rec}}), \end{aligned}$$
(22)
$$\begin{aligned} \frac{\partial Q}{\partial \log \beta }&= \frac{\beta }{N} (\mathcal {S}^+_{\text {inf}} \otimes \mathcal {I}^+_{\text {inf}}) - \frac{\beta }{N} (\mathcal {S}^-_{\text {inf}} \otimes \mathcal {I}^-_{\text {inf}}). \end{aligned}$$
(23)

Finally, differentiated uniformization requires a differentiable upper bound \(\gamma \) on the absolute diagonal entries of Q. For the SIR model we choose the exact maximum

$$\begin{aligned} \gamma = \underset{x}{\max }|Q_{x,x}|&= \max \left\{ |Q_{(N-1,N-1),(N-1,N-1)}|, |Q_{(N,N),(N,N)}| \right\} \nonumber \\&= \max \left\{ N(N-1)\frac{\beta }{N}+(N-1)\alpha , N\alpha \right\} \nonumber \\&= \max \left\{ (N-1)\beta +(N-1)\alpha , \alpha +(N-1)\alpha \right\} \nonumber \\&= (N-1)\alpha + \max \{(N-1)\beta , \alpha \}. \end{aligned}$$
(24)

It is differentiableFootnote 5 for \(\alpha \ne (N-1)\beta \) with

$$\begin{aligned} \frac{\partial \gamma }{\partial \log \alpha }&= {\left\{ \begin{array}{ll} N\alpha &{} \text{ if } \alpha > (N-1)\beta ,\\ (N-1)\alpha &{} \text{ if } \alpha < (N-1)\beta , \end{array}\right. }\end{aligned}$$
(25)
$$\begin{aligned} \frac{\partial \gamma }{\partial \log \beta }&= {\left\{ \begin{array}{ll} 0 &{} \text{ if } \alpha > (N-1)\beta ,\\ (N-1)\beta &{} \text{ if } \alpha < (N-1)\beta . \end{array}\right. } \end{aligned}$$
(26)

Overall, differentiated uniformization performs \(\mathcal {O}(\gamma )\) matrix–vector products and thus has a total runtime complexity in \(\mathcal {O}(\gamma N^2)=\mathcal {O}(N^3)\) for the SIR model. It requires storage of the result \(\textbf{p}(t)\), which has complexity \(\mathcal {O}(N^2)\).

For parameter inference we are typically only interested in the likelihood that an earlier data point (SI) is followed by a later data point \((S+\Delta S, I+\Delta I)\) after time t. Since the number of susceptibles cannot increase (\(\Delta S \le 0\)) and the number of recovered cannot decrease (\(\Delta R = -\Delta S - \Delta I \ge 0\)) along a trajectory, it is sufficient to compute \(\textbf{p}(t)\) and \(\textbf{p}(t)'\) on the restricted state space

$$\begin{aligned} \{ S+\Delta S, \dots , S \} \times \{ I -\Delta R, \dots , I-\Delta S \}, \end{aligned}$$

as explained in Appendix A. Following Ho et al. (2018) we use this state-space restriction to reduce the time complexity of our algorithm to \(\mathcal {O}\big ((I+\vert \Delta S\vert )(\Delta S^2 + \vert \Delta S\vert \Delta R)\big )\) and its storage complexity to \(\mathcal {O}(\Delta S^2 + \vert \Delta S\vert \Delta R)\).

4 COVID-19 pandemic

Here we model the first wave of the COVID-19 pandemic in Austria as a stochastic SIR model. We employ differentiated uniformization to estimate the parameters \(\alpha \) and \(\beta \) and quantify their uncertainty. We use daily numbers on S, I and R between 2020-03-01 and 2020-09-01 from public health data provided by the Austrian Bundesministerium für Soziales (2021) (Fig. 4). I and R are given directly, and we set \(S=N-I-R\) assuming that the initial population size \(N=8{,}932{,}664\) stays constant. People who have died from COVID-19 are counted under “recovered” in a technical sense as they are no longer infectious. We do not correct for undiscovered cases and biases in testing and reporting. We also assume that parameters are piecewise constant for each month.

We do a full Bayesian analysis for parameter pairs \((\log \alpha ,\log \beta )\) with a uniform prior for each parameter in the interval between \(\log (0.01/\text {day})\) and \(\log (1/\text {day})\). This highlights the shape of the likelihood of the model but can be substituted by any other prior informed by expert knowledge. Following Ho et al. (2018) we sample from the joint posterior using a Hamiltonian Monte Carlo (HMC) scheme (Duane et al. 1987; Neal 2011) as implemented in the software package PyMC (Oriol et al. 2023). Unlike a standard Metropolis-Hastings scheme, HMC makes use of the gradient of the likelihood, which we compute using differentiated uniformization. This makes sampling more efficient with less samples needed to cover the posterior distribution (Gelman et al. 2013).

We estimated the joint posterior of \((\log \alpha ,\log \beta )\) for every month between March 2020 and August 2020 separately. For each month we performed 10 parallel Monte Carlo chains with length 1000, where we discarded the first 100 points each, resulting in 9000 points per month. These calculations were done on the QPACE 3 cluster (Georg et al. 2018). For each posterior we recorded the runtime (averaged over the 10 chains) and measured the marginal folded effective sample sizes (ESS) (Vehtari et al. 2021) for \(\alpha \) and \(\beta \), see Table 1.

Table 1 Diagnostics of HMC sampling
Table 2 Diagnostics of MH sampling

Figure 5 shows the results of this analysis. The estimated posterior is plotted \((\alpha ,\beta )\) on logarithmic scales. The gray shaded areas were generated using Gaussian-kernel density estimation applied to the posterior samples. The crosses mark the least-squares estimators of the corresponding deterministic SIR models. The dashed lines represent parameter constellations where \(\alpha =\beta \) and thus \(\mathcal {R}_0=1\). Here the epidemic switches between growing and decreasing numbers of infected. From April-August 2020 the posterior of the recovery rate \(\alpha \) varies around a value of 0.07 per day, corresponding to the realistic mean time to recovery of about 2 weeks (Faes et al. 2020). In contrast, the posterior of \(\alpha \) in March 2020 appears to be off, with a mean of about 0.03 per day corresponding to a mean time to recovery of one month. Inspecting the original numbers, we observed that the numbers of recovered are unexpectedly low (less than 100 people until 2020-03-23) possibly due to lagging declaration of recoveries because of cautious hospital policies in the beginning of the pandemic.

Fig. 3
figure 3

Marginal trace plots of \(\alpha \) and \(\beta \) for May in a single chain for the HMC sampling and MH sampling. The first 100 sample points are discarded as burn-in and shown in grey

Finally, we compared the HMC sampling to a random walk Metropolis Hastings (MH) sampling, see Table 2. We evaluated both methods on the same hardware and used their respective implementations in PyMC with 100 tuning iterations for their hyperparameters. As expected, MH required a much lower runtime for a fixed total sample size. However, the effective sample sizes were substantially larger for HMC, such that HMC outperformed MH in terms of ESS per runtime. Figure 3 shows the marginal trace plots of \(\alpha \) and \(\beta \) for May in a single chain for each of both samplings. Trace plots for the other months and autocorrelation plots are available in the supplement.

Fig. 4
figure 4

Daily reported numbers of people infected by and recovered from SARS-CoV-2 in Austria

Fig. 5
figure 5

Posterior probability densities over parameter pairs \((\alpha ,\beta )\) for separate stochastic SIR models of the first six months of the COVID-19 pandemic in Austria. The dashed lines indicate parameters where the basic reproduction number \(\mathcal {R}_0=\beta /\alpha =1\). The crosses mark the least-squares estimators of the corresponding deterministic SIR models

5 Discussion

We provide a novel method for computing the transient distribution and its derivative for continuous-time Markov Chains on huge discrete state spaces. This makes parameter inference tractable for a large family of statistical models, including the stochastic SIR model of epidemic spread.

Our key observation is that the transition-rate matrix of an SIR model can be written as a sum of tensor products, which allows us to cheaply compute matrix–vector products without storing the matrix itself. This operation alone is sufficient to compute the transient distribution by the uniformization method (Grassmann 1977), a numerically stable power-series expansion of the matrix exponential. We propose the differentiated uniformization method, an analogous power series for computing the derivative of the transient distribution with respect to parameters of a CTMC.

For the SIR model our algorithm scales cubically in the size of the population, which is one order slower than the state-of-the-art method for multivariate birth processes (Ho et al. 2018). On the other hand, our general-purpose algorithm also applies to birth-death processes such as predator–prey dynamics (Owen et al. 2014), which have been considered intractable so far (Ho et al. 2017). We illustrate this in Appendix B. In general our algorithm is applicable to any CTMC of interacting discrete variables. It scales exponentially in the number of variables but polynomially in the size of each variable’s state space. These variables could be additional compartments in an epidemic model, such as the number of exposed but not yet infected people, asymptomatic carriers or deceased people.

Beyond epidemiology we are interested in tumor progression modeling using mutual hazard networks (Schill et al. 2019). Similar to an epidemiological model that scales exponentially in the number of compartments, a tumor progression model is a CTMC that scales exponentially in the number of possible mutations. While both differentiated uniformization and the algorithm of Ho et al. (2018) have the potential to advance this field, large scale inference remains an open problem for tumor progression models with up to hundreds of mutations. The tensor representation of the transition-rate matrix could serve as a starting point for representing the transient distribution itself in a low-rank tensor format. These formats reduce the exponential cost (e.g., in the number of mutations or compartments) to linear cost provided certain low-rank structures exist (Hackbusch 2012). For large-scale CTMCs, low-rank tensor formats were already successfully used, e.g., for the computation of transient (Johnson et al. 2010) and stationary distributions (Benson et al. 2017; Buchholz et al. 2016; Kressner and Macedo 2014) and also for a variant of the uniformization method (Georg et al. 2020). Therefore, the combination of low-rank tensor formats and differentiated uniformization could be a promising new avenue for large-scale inference problems in computational oncology and epidemiology.

From this perspective our work can also be seen as an attempt to connect these two communities.