1 Introduction

1.1 Motivation

A notion of distance between probability distributions is required in many fields, e.g., information and probability theory, approximate Bayesian computation (Csilléry et al. 2010; Schill et al. 2019), spectral (re-) construction (Hoch 2014; Rothkopf 2017; Ha et al. 2019), or compression using variational autoencoders (Kingma and Welling 2022). In mathematical terms, this notion can be phrased in terms of divergence measures.

One of the most commonly used divergence measures is the Kullback–Leibler (KL) divergence (Kullback and Leibler 1951), which for two discrete probability vectors \(\textbf{p}=(p_1,\ldots ,p_n)\) and \(\textbf{q}=(q_1,\ldots ,q_n)\) has the form

$$\begin{aligned} D_{KL }({\textbf {p}}\Vert {\textbf {q}})=\sum _{i=1}^n p_i\log \frac{p_i}{q_i}\,. \end{aligned}$$
(1)

The KL divergence includes the logarithm, which is well-defined in the usual case of probability vectors with nonnegative entries. However, in practical applications, situations can arise that lead to negative entries, for which the logarithm is not defined. Let us give a few examples. In high-dimensional spaces, approximations are often needed to keep calculations tractable and to avoid runtime explosion due to the curse of dimensionality (Thomas and Grima 2015; Georg et al. 2022). Such approximations can lead to negative entries when the exact entry is close to zero, as is frequently the case for probability vectors. Another source of negative entries are rounding errors that occur, e.g., when the probability vector is obtained as the solution of a linear system of equations (Philippe et al. 1992; Schill et al. 2019). Also, assumptions of a theory can lead to negative entries in the approximate probability distribution (Alkofer and Smekal 2001; Burnier et al. 2011; Rothkopf 2017).

If negative entries arise, one is faced with the choice of (a) giving up, (b) reformulating the theory, model, or approximation such that negative entries cannot occur, or (c) devising a workaround that leads to correct results in a suitable limit. In Sect.  1.2 we review approaches from category (b) and (c) to address the problem of negative entries (Paatero and Tapper 1994; Hobson and Lasenby 1998; Welling and Weber 2001; Chi and Kolda 2012; Haas et al. 2014; Hansen et al. 2015; Lee et al. 2016; Rothkopf 2017). The methods we are aware of are typically tailored to the problem at hand or lead to reduced run-time performance.

Here, we suggest a method in category (c) that is generically applicable and does not suffer from decreased performance. Specifically, we propose the shifted Kullback–Leibler (sKL) divergence as a substitute divergence measure for the standard KL divergence in the case of approximate probability vectors with negative entries. It is parameterized by a vector of shift parameters. The sKL divergence is a modification of the KL divergence and retains many of the latter’s useful properties while handling negative entries when they arise. To give theoretical support to our method, we consider a simple example in which i.i.d. Gaussian noise is added to the entries \(q_i\) in Eq. (1) so that negative entries can occur. For this example we prove that the difference between the KL divergence and the expected sKL divergence under the noise is quadratic in the standard deviation of the noise, provided that the shift parameters are suitably chosen. Therefore the sKL divergence converges to the KL divergence in the limit of small i.i.d. Gaussian noise. We also show this convergence numerically.

In a concrete application, we show that the sKL divergence enables efficient learning of a Mutual Hazard Network (MHN), a model of cancer progression as an accumulation of certain genetic events (Schill et al. 2019; Chen 2023; Luo et al. 2023). In an MHN, cancer progression is modeled as a continuous-time Markov chain whose transition rates are parameters of the model that are learned such that the model explains a given patient-data distribution. When considering \(\gtrsim 25\) possible genetic events, exact model construction becomes impractical and requires efficient approximation techniques (Georg et al. 2022). We show that even though negative entries occur in the approximate probability vectors, meaningful models can still be constructed when the sKL divergence is employed.

1.2 Related work

Various approaches to handling or avoiding negative entries in approximate probability distributions exist in the literature.

In earlier work (Georg 2022), we considered introducing a threshold \(\varepsilon > 0\) and replacing entries of the probability vector that are smaller than this threshold by \(\varepsilon \). This method fails to preserve probability mass and leads to a loss of gradient information when used in an optimization task. Additionally, the comparison function obtained in this way no longer meets the requirements of a statistical divergence (Amari 2016), i.e., none of the required properties in Theorem 1 below are satisfied in this case.

For high-dimensional probability distributions, several nonnegative tensor decompositions have been proposed in order to break the curse of dimensionality while avoiding negative entries (Paatero and Tapper 1994; Welling and Weber 2001; Chi and Kolda 2012; Hansen et al. 2015; Lee et al. 2016). Typically, this change of format leads to a significant increase in the run time of the algorithms involved, as other important properties for a time-efficient approximation have to be omitted.

In situations where negative entries occur due to assumptions of the theory, a common approach is to modify the divergence measure (Hobson and Lasenby 1998; Rothkopf 2017; Haas et al. 2014). Such modifications are typically problem-specific and therefore not applicable in general.

Our approach is in a similar spirit, but our modification of an established divergence measure is not problem-specific, as it does not depend on special properties of a particular application. Furthermore, since we do not need specific formats to preserve nonnegativity, optimization tasks in high dimensions can be completed efficiently.

This paper is structured as follows. In Sect. 2 we define the sKL divergence, discuss its properties and give suggestions for how to choose the shift parameters. In Sect. 3 we show how the sKL divergence can be applied in optimization tasks on approximate probability vectors. In Sect. 4 we summarize our findings and give an outlook on future research topics. In Sect. A and B we provide proofs of three theorems on the sKL divergence and mathematical details of the approximation we employ.

2 Theory

2.1 The sKL divergence

For two probability vectors \(\textbf{p}\) and \(\textbf{q}\) on a finite set of n elements, the Kullback–Leibler (KL) divergence of \(\textbf{p}\) from \(\textbf{q}\) is defined as (Kullback and Leibler 1951)

$$\begin{aligned} D_{KL }({\textbf {p}}\Vert {\textbf {q}})=\sum _{i=1}^n p_i\log \frac{p_i}{q_i} \end{aligned}$$

with the convention \(0 \log (0/x)=0\). In the context of statistical modeling, the vector \(\textbf{p}\) (with \(p_i\ge 0\)) is typically given by the data, while the vector \(\textbf{q}\) is a theoretical model. This model can be fitted to the data by minimizing the KL divergence of \(\textbf{p}\) from \(\textbf{q}\). Approximations in the calculation of \(\textbf{q}\) can lead to negative entries \(q_i<0\).Footnote 1 In this case, the KL divergence is no longer defined and the optimization cannot be done. For this scenario, we propose the shifted Kullback–Leibler (sKL) divergence, a modification of the Kullback–Leibler divergence that allows for negative entries in \(\textbf{q}\). For a parameter vector \(\varvec{\upvarepsilon }\in \mathbb {R}_{\ge 0}^n\), we define it as

$$\begin{aligned} D_{sKL }({\textbf {p}}\Vert {\textbf {q}})=\sum _i \left( p_i+\varepsilon _i\right) \log \frac{p_i+\varepsilon _i}{q_i+\varepsilon _i} \end{aligned}$$
(2)
Fig. 1
figure 1

Plot of the KL and sKL divergence of probability vector \(\textbf{p}=\left( 0.2, 0.8\right) \) from \(\textbf{q}=\left( q_1, 1-q_1\right) \). For the sKL divergence, \(\varepsilon _i=0.05\) was chosen for \(i=1,2\)

with the same convention as above. The sKL divergence is well defined for \(q_i>-\varepsilon _i\). It is a modification of the KL divergence that reduces to the KL divergence for \(\varvec{\upvarepsilon }=0\). Figure 1 shows the behavior of both the KL and the sKL divergence for probability vectors on a set with two possible outcomes. We can see that the domain on which the sKL divergence is defined now also includes approximate probability vectors with small negative entries. In the following two subsections we show that the sKL divergence retains many important properties of the KL divergence and discuss how to best choose the parameters \(\varepsilon _i\).

2.2 Properties

The KL divergence belongs to the class of f-divergences (Basseville 2013). While the sKL divergence does not, it shares many important properties with the KL divergence: it is positive semidefinite, only zero for \(\textbf{p}=\textbf{q}\), and locally a metric. Thus, it still satisfies the definition of a statistical divergence, see Theorem 1.

Theorem 1

Let \(\textbf{p}\), \(\textbf{q}\) and \(\varvec{\upvarepsilon }\) be three vectors in \(\mathbb {R}^n\) whose entries satisfy \(p_i,q_i>-\varepsilon _i\) and \(\sum _ip_i=\sum _iq_i\). The sKL divergence of \(\textbf{p}\) from \(\textbf{q}\), with parameter vector \(\varvec{\upvarepsilon }\), is a statistical divergence, i.e., it satisfies the following properties (Amari 2016),

  1. (1)

    \(D_{sKL }(\textbf{p}\Vert \textbf{q})\ge 0,\)

  2. (2)

    \(D_{sKL }(\textbf{p}\Vert \textbf{q})=0\), if and only if \(\textbf{p}=\textbf{q}\)

  3. (3)

    \(\frac{d ^2D_{sKL }(\textbf{p}\Vert \textbf{q})}{d q_id q_j}\big |_{\textbf{q}=\textbf{p}}\) is a positive definite matrix.

The first assumption of the theorem can always be satisfied by a suitable choice of the shift parameters \(\varepsilon _i\), while the second assumption can always be satisfied by a rescaling of the \(q_i\). Additionally, we show in Theorem 2 that the sKL divergence is convex in the pair of its arguments, like the KL divergence (van Erven and Harremos 2014).

Theorem 2

For fixed parameter vector \(\varvec{\upvarepsilon }\), the sKL divergence is convex in the pair of its arguments. That is, if \((\textbf{p}^{(1)}, \textbf{q}^{(1)})\) and \((\textbf{p}^{(2)}, \textbf{q}^{(2)})\) are two pairs of vectors in \(\mathbb {R}^n\) for which \(D_{sKL }(\textbf{p}^{(1)}\Vert \mathbf {q^{(1)}})\) and \(D_{sKL }(\textbf{p}^{(2)}\Vert \mathbf {q^{(2)}})\) are well-defined, and if \(\lambda \in [0, 1]\), it satisfies

$$\begin{aligned}&D_{sKL }(\lambda \textbf{p}^{(1)}+(1-\lambda )\textbf{p}^{(2)}\Vert \lambda \textbf{q}^{(1)}+(1-\lambda )\textbf{q}^{(2)}) \\ \quad&\le \lambda D_{sKL }(\textbf{p}^{(1)}\Vert \textbf{q}^{(1)}) + (1-\lambda ) D_{sKL }(\textbf{p}^{(2)}\Vert \textbf{q}^{(2)})\,. \end{aligned}$$

This property is of particular importance as it is often needed for well-behaved optimization (Fletcher 2000). The proofs of the two theorems are given in Sect. A.

2.3 Parameter choice

In the sKL divergence, we introduced a parameter vector \(\varvec{\upvarepsilon }=(\varepsilon _1,\ldots ,\varepsilon _n)\). The properties of the sKL divergence discussed in the previous subsection (Theorem 1 and 2) hold for any choice of the shift parameters \(\varepsilon _i\). However, in applications, their values can have a large influence on the quality of the results, such as speed of convergence or accuracy of the learned model. In this section we therefore aim to provide a guide for choosing suitable values of the parameters \(\varepsilon _i\) for the case that \(\textbf{p}\) is an exact probability vector and \(\textbf{q}\) is an approximate one. In the KL divergence, the terms in the sum only need to be evaluated for the indices i for which \(p_i\ne 0\). This property reduces the computational cost greatly if many entries of \(\textbf{p}\) are zero. This situation is encountered, for example, when optimizing an MHN (Schill et al. 2019). In order to preserve this advantage when working with the sKL divergence, one can simply choose \(\varepsilon _i=0\) if \(p_i=0\). To constrain possible values for \(\varepsilon _i\), we first note that if \(q_i<0\), the sKL divergence is only well-defined if one chooses \(\varepsilon _i>-q_i\). Second, we compute the gradient of the sKL divergence,

$$\begin{aligned} \frac{\text{ d }D_{sKL }({\textbf {p}}\Vert {\textbf {q}})}{\text{ d }q_i}=-\frac{p_i+\varepsilon _i}{q_i+\varepsilon _i}\,, \end{aligned}$$

and notice that it approaches \(-1\) for large values of \(\varepsilon _i\). Most optimization algorithms use gradient information to minimize a loss function (Fletcher 2000). Therefore it is important to retain the gradient information, which implies that \(\varepsilon _i\) should not be too large.

Using these guidelines, we provide two particular choices of \(\varvec{\upvarepsilon }\). First, in Sect. 2.3.1, we discuss a natural, static choice of \(\varvec{\upvarepsilon }\) and examine its shortcomings. In Sect. 2.3.2, we then suggest a dynamic choice of \(\varvec{\upvarepsilon }\) and prove in Theorem 3 that the resulting sKL divergence is a good substitute for the KL divergence in the presence of small i.i.d. Gaussian noise. When possible, one should therefore prefer the dynamic choice of \(\varvec{\upvarepsilon }\), as we will also demonstrate in Sect. 3.3.

2.3.1 Static choice

We first suggest a choice of \(\varvec{\upvarepsilon }\) that can be used with any optimization algorithm. This includes higher-order algorithms such as BFGS or conjugate-gradient methods, which make use of approximations of the Hessian matrix of the loss function (Fletcher 2000). For higher-order methods, the objective function must not change during optimization. This requires us to choose a suitable \(\varvec{\upvarepsilon }\) a priori, which can be a difficult task. A simple approach is to evaluate \(\textbf{q}\) once before starting the optimization and to set

$$\begin{aligned} \varepsilon _i={\left\{ \begin{array}{ll} 0\,, &{} p_i=0\,, \\ \varepsilon \,, &{} \text {else}\,, \end{array}\right. } \end{aligned}$$
(3)

where \(\varepsilon \) is fixed to a number slightly larger than \(\max _{p_j > 0} (-q_j)\). If during optimization \(\varepsilon \) turns out to be too small, i.e., negative entries of \(q_i+\varepsilon \) are encountered, the optimization has to be stopped early. The best possible result can then be found by tuning \(\varepsilon \).

This parameter choice allows us to choose a wide variety of optimizers. However, in addition to the requirement of tuning \(\varepsilon \), the static choice leads to an unnecessarily large loss of gradient information. This is because gradient information is reduced for all i, even though \(q_i<0\) is rarely encountered in practice.

2.3.2 Dynamic choice

We now suggest a second choice that can be used only if the objective function is allowed to change in every iteration of the optimizer. This does not pose a problem when using first-order optimization algorithms such as gradient descent (Fletcher 2000) or Adam (Kingma and Ba 2017), as these do not use information about the Hessian matrix of the loss function. In this scenario, we can choose a new \(\varvec{\upvarepsilon }\) for every new \(\textbf{q}\) during the optimization process. Our proposed choice is

$$\begin{aligned} \varepsilon _i={\left\{ \begin{array}{ll} 0\,, &{} p_i=0 \text { or } q_i>0\,, \\ |q_i| + f(|q_i|)\,, &{} \text {else}\,, \end{array}\right. } \end{aligned}$$
(4)

where \(f:\mathbb {R}_{\ge 0} \rightarrow \mathbb {R}_{\ge 0}\) is a nonnegative function. This choice of \(\varvec{\upvarepsilon }\) avoids a large loss in gradient information while extending the domain on which the sKL divergence is defined where necessary. A concrete choice of f, which we use in Sect. 3, is given by \(f(x)=\delta \cdot x\) with a constant \(\delta >0\), but many other choices are possible.

2.4 Negative entries due to Gaussian noise

To further motivate Eq. (4), let us consider a simple example in which negative entries of \(q_i\) are caused by small Gaussian noise. In this case, we show that the difference between the KL divergence and the average of the sKL divergence over the Gaussian noise is quadratic in the standard deviation of the noise, see Theorem 3, the proof of which is provided in Sect. A. To keep the presentation simple, we restrict ourselves to nonnegative functions f that are finite sums of power laws, i.e., \(f(x)=\sum _ka_kx^{b_k}\) with at least one \(a_k\ne 0\).

Theorem 3

Let \(\textbf{p}\) and \(\textbf{q}\) be two probability vectors on a finite set of n elements, with \(q_i>0\) whenever \(p_i>0\). Let further \(\textbf{x}\) be a vector of i.i.d. Gaussian random variables with mean 0 and standard deviation \(\sigma \). If the \(\varepsilon _i\) are chosen according to Eq. (4) with the restriction on f stated above, the average of the sKL divergence of \(\textbf{p}\) from \(\textbf{q}+\textbf{x}\) is given by

$$\begin{aligned} \big \langle D_{sKL }(\textbf{p}\Vert \textbf{q}+\textbf{x})\big \rangle _{\textbf{x}} = D_{KL }(\textbf{p}\Vert \textbf{q})+\sigma ^2\sum _i\frac{p_i}{2q_i^2}+\mathcal {O}(\sigma ^4)\,. \end{aligned}$$

In fact, the restriction on f can be relaxed significantly so that a much larger class of functions is admissible, see Sect. A.3 for details.

In this example, we therefore see explicitly that the sKL divergence can be used as a substitute divergence measure for the KL divergence, provided that the condition \(\sigma ^2\sum _ip_i/2q_i^2\ll D_{KL }({\textbf {p}}\Vert {\textbf {q}})\) is satisfied.

To validate Theorem 3 numerically, we first consider a generic toy model where \(\textbf{p},\textbf{q}\in \mathbb R^{10}\) are probability vectors whose entries are i.i.d. uniform random variables. As per the assumptions of Theorem 3, we consider i.i.d. Gaussian random variables added to \(\textbf{q}\), and we choose the \(\varepsilon _i\) according to Eq. (4) with the simple choice \(f(x)=\delta \cdot x\). In order for the \(\sigma ^2\) correction to be small, we restrict ourselves to probability vectors \(\textbf{p}\) and \(\textbf{q}\) that satisfy

$$\begin{aligned} \sigma ^2\sum _i\frac{p_i}{2q_i^2}\le \frac{1}{2}D_{KL }({\textbf {p}}\Vert {\textbf {q}}) \end{aligned}$$

for all \(\sigma \) considered. Figure 2 shows the convergence of the sKL divergence to the KL divergence in the limit of small \(\sigma \) of the noise. The plot shows averages of 200 pairs of \(\textbf{p}\) and \(\textbf{q}\). We can see that in this case Theorem 3 gives a good prediction for the relative difference between the sKL divergence and the KL divergence. Interestingly, for large values of \(\delta \), the relative difference is significantly lower than predicted. Therefore, large values \(\delta \ge 10^1\) should be preferred in this toy model, but this may be different for other applications, in particular when the negative entries are not due to Gaussian noise. Finally, we observe that the parameter \(\delta \) in the function f has only little effect on the result when \(\sigma \) is sufficiently small.

Fig. 2
figure 2

Relative difference between the KL divergence and the sKL divergence with the dynamic choice of \(\varvec{\upvarepsilon }\) given by Eq. (4) with \(f(x)=\delta \cdot x\), for different noise levels \(\sigma \). For each \(\sigma \), the dashed line shows the prediction from Theorem 3 truncated at order \(\sigma ^2\), the solid line shows the result obtained from numerical integration of Eq. (A1), and the points show the result obtained from simulation

In many applications, the strength of the noise can be controlled by parameters that determine the numerical accuracy of the approximation. If the assumption of i.i.d. Gaussian noise is satisfied in a particular application, Theorem 3 is directly applicable. In other applications, the noise might follow a different distribution, for which one would have to do a similar analysis.

In the concrete application we consider in Sect. 3, we have seen numerically that the assumption of i.i.d. Gaussian noise is not satisfied. Nevertheless, as we increase the numerical accuracy of our approximation, we observe that the sKL divergence with the choice of Eq. (4) converges to the KL divergence computed without approximations. Also, the model results obtained from optimizing the sKL divergence with the choice of Eq. (4) converge to the model result obtained from the KL divergence without approximations.

3 Application

3.1 Mutual Hazard Networks

As a real-world application, we consider the modeling of cancer progression using Mutual Hazard Networks (MHNs) (Schill et al. 2019). Cancer progresses by accumulating genetic events such as mutations or copy-number aberrations (Michor et al. 2004). As each event can be either absent or present, the state of an event is represented by 0 (absent) or 1 (present). We consider d such events, thus representing a tumor as a d-dimensional vector \(x\in \{0,1\}^d\). An MHN aims to infer promoting and inhibiting influences between single events. The data used for the learning process are patient data of observed tumors. The data distribution \(\textbf{p}\) is thus a discrete probability distribution on the \(2^d\)-dimensional state space \(\mathcal {S}=\left\{ x\in \{0,1\}^d\right\} \) of possible tumors, i.e., \(n=2^d\) in the notation of Sect. 2.

An MHN models the progression as a continuous-time Markov chain under the assumptions that at time \(t=0\) no tumor has active events, that events occur only one at a time, that events are not reversible, and that transition rates follow the Proportional Hazard Assumption (Cox 1972). If we have two states \(x, x^{+i}\in \mathcal {S}\) that differ only by \(x_i=0\) and \(x^{+i}_i=1\), the transition rate from state x to state \(x^{+i}\) is modeled asFootnote 2

$$\begin{aligned} \textbf{Q}_{x^{+i}, x} = e^{\theta _{ii}}\prod _{x_j=1}e^{\theta _{ij}}\,, \end{aligned}$$

where \(e^{\theta _{ii}}\) is the base rate of event i and \(e^{\theta _{ij}}\) is the multiplicative influence event j has on event i. An MHN with d events can thus be described by a parameter matrix \(\theta \in \mathbb {R}^{d\times d}\). Figure 3 shows all allowed transitions and their rates for \(d=3\). The transition-rate matrix can efficiently be written as a sum of d Kronecker products,

$$\begin{aligned} \textbf{Q}=\sum _{i=1}^d\bigotimes _{j=1}^dQ_{ij} \end{aligned}$$
(5)

with

$$\begin{aligned} Q_{ij}=\begin{pmatrix}1&{}0\\ 0&{}e^{\theta _{ij}}\end{pmatrix}\text { for } i\ne j \text { and } Q_{ii}=\begin{pmatrix}-e^{\theta _{ii}}&{}0\\ e^{\theta _{ii}}&{}0\end{pmatrix}. \end{aligned}$$
Fig. 3
figure 3

Allowed transitions, along with transition rates, for an MHN with three possible events and parameter matrix \(\theta \in \mathbb {R}^{3 \times 3}\)

Starting from the initial distribution \(\textbf{q}_\varnothing =(1,0,0,\ldots )\in \mathbb {R}^{2^d}\), the probability distribution at time \(t \ge 0\) is given by Markov-chain theory as

$$\begin{aligned} \textbf{q}(t)=e^{t\textbf{Q}}\textbf{q}_\varnothing \,, \end{aligned}$$

where \(e^{t\textbf{Q}}\) denotes the matrix exponential. Since tumor age is not known in the data, MHNs assume that the age of tumors in \(\textbf{p}\) is an exponentially distributed random variable with mean 1. If we marginalize \(\textbf{q}(t)\) over t accordingly, we obtain

$$\begin{aligned} \textbf{q}_\theta :=\int _0^\infty \text {d}t\;e^{-t}e^{t\textbf{Q}}\textbf{q}_\varnothing =\left( \textbf{I}-\textbf{Q}\right) ^{-1}\textbf{q}_\varnothing \,, \end{aligned}$$
(6)

where \(\textbf{I}\) denotes the identity.

A parameter matrix \(\theta \) that best fits the data distribution \(\textbf{p}\) can now be obtained by minimizing a divergence measure of \(\textbf{p}\) from \(\textbf{q}_\theta \). For small d (i.e., \(d\lesssim 25\)), the KL divergence can be used since the calculation of \(\textbf{q}_\theta \) can be done without approximation. For larger d, this is no longer possible because of the exponential complexity (recall that the dimension of the state space is \(2^d\)), and the approach has to be modified (Georg et al. 2022). One possible modification is the use of low-rank tensor methods (Grasedyck et al. 2013) in order to keep calculations tractable. In particular, we use the tensor-train (TT) format (Oseledets 2011), which usually minimizes the approximation error through the Euclidean distance. Specifically, when approximating a tensor \(\textbf{a}\) by \(\tilde{\textbf{a}}\), a maximum Euclidean distance \(\Delta =||\textbf{a}-\tilde{\textbf{a}}||_2\) can be specified. Thus, small negative entries can occur in \(\tilde{\textbf{a}}\) if the corresponding entry in \(\textbf{a}\) is smaller than \(\Delta \). In this case, the KL divergence is no longer defined. To be able to perform the optimization, we switch to the sKL divergence as our objective function. In Sect. 3.2, we explain the basics of the TT format. Sect. 3.3 shows how we can find suitable \(\theta \) matrices by use of the sKL divergence even when negative entries are encountered.

3.2 Tensor Trains

For a large number d of events, storing \(\textbf{q}\) (and \(\textbf{p}\)) as a \(2^d\)-dimensional vector is computationally infeasible. This storage requirement can be alleviated through use of the TT format (Oseledets 2011). A d-dimensional tensor \(\textbf{a}\in \mathbb {R}^{n_1\times \cdots \times n_d}\) with mode sizes \(n_k\in \mathbb {N}\) can be approximated in the TT format as

$$\begin{aligned} \textbf{a}(i_1, \ldots , i_d) \approx \sum _{\alpha _0,\ldots ,\alpha _d} \prod _{k=1}^{d} \textbf{a}^{(k)}(\alpha _{k-1}, i_k, \alpha _k) \end{aligned}$$

for all \(i_1, \dots , i_d\) with tensor-train cores \(\textbf{a}^{(k)}\in \mathbb {R}^{r_{k-1}\times n_k\times r_k}\), where the \(r_k\) are tensor-train ranks.Footnote 3 In order to represent a scalar by the right-hand side, the condition \(r_0=r_d=1\) is required. The quality of approximation can be controlled through the TT ranks \(r_k\). In particular, it can be shown that choosing the TT ranks large enough gives an exact representation of \(\textbf{a}\) (Oseledets 2011).

The TT format not only allows for efficient storage of high-dimensional tensors, but also supports many basic operations, e.g., addition, inner products, or operator-by-tensor products (Oseledets 2011). Furthermore, there are efficient algorithms for solving linear equations (Holtz et al. 2012; Dolgov and Savostyanov 2014).

By modifying the shape of the objects in Eq. (5) and changing the Kronecker products to tensor products, the transition-rate matrix \(\textbf{Q}\in \mathbb {R}^{\mathcal {S}\times \mathcal {S}}\) can be written in the tensor-train format (Hackbusch 2019) (for details see Sect. B). This leads to a tensor train with all TT ranks \(r_1,\ldots ,r_{d-1}\) equal to d. Thus, we can approximately solve Eq. (6) in the TT format using Holtz et al. (2012). Similar techniques can be used for the gradient calculation (Georg 2022).

Details of the algorithms involved (Holtz et al. 2012) lead to the conditions \(r_{k+1}\le n_{k+1}r_k\) and \(r_{k-1}\le n_kr_k\) on the TT ranks of \(\textbf{q}_\theta \). As a result, the first and last TT ranks are \(r_1\le n_1\) and \(r_{d-1}\le n_d\). Towards the middle of the tensor train, ranks increase until they level off at the specified maximum rank. In our simulations, we specify a maximum TT rank \(r_{\textbf{q}}\) and choose the TT ranks \(r_k\le r_{\textbf{q}}\) to be as large as possible given the constraints on the \(r_k\).

3.3 Simulations

We test how well an MHN can learn a probability distribution \(\textbf{p}\) when optimizing the sKL divergence instead of the classical KL divergence. We use simulated data for \(d=20\) events. This relatively small value of d allows for exact calculations so that we can compare results obtained with and without the use of tensor trains. In the following, we describe how the data were generated, how the MHNs were learned, and how their quality was assessed.

Every dataset was generated from a ground-truth model described by \(\theta _{\text {GT}}\in \mathbb {R}^{d\times d}\). The diagonal entries of \(\theta \) were drawn from a Gaussian distribution \(\exp (-(x-\mu )^2/2\sigma ^2)/ \sigma \sqrt{2\pi }\) with \(\mu =-1\) and \(\sigma =1\). A random set of \(10\%\) of the off-diagonal entries of \(\theta \) was drawn from a Laplace distribution \(\exp (-|x-\mu |/b)/2b\) with \(\mu =0\) and \(b=1\), and the remaining entries were set to 0. These choices were made to mimic MHNs obtained from real data we studied. The data distribution \(\textbf{p}\) was obtained from \(\theta _{\text {GT}}\) by drawing 1000 samples from its time-marginalized probability distribution, as defined in Eq. (6).

Given a dataset, MHNs were learned by optimizing the sKL divergence. We indirectly control the magnitude of the largest negative entries of \(\textbf{q}_\theta \) through our choice of the maximum possible TT rank \(r_\textbf{q}\) of \(\textbf{q}_\theta \). 10 datasets were generated, and for all of them, MHNs were learned for specific choices of \(r_\textbf{q}\) and \(\varvec{\upvarepsilon }\). The results shown below are arithmetic means from these 10 runs. To avoid overfitting, an L1 penalty term of the form \(\lambda \sum _{i\ne j}|\theta _{ij}|\) was added to the objective function. The factor \(\lambda \) was not made part of the optimization but set to a constant value of \(10^{-3}\) for all simulation runs to make comparison of different models easier.

A learned MHN’s quality was assessed using the KL divergence of the ground truth model’s time-marginalized probability distribution from the time-marginalized probability distribution of the learned MHN, see Eq. (6). This KL divergence was calculated without the use of the TT format to ensure that no negative entries can occur.

3.3.1 Static choice

First, we consider the static choice of \(\varvec{\upvarepsilon }\) given in Eq. (3). In this case, higher-order optimizers can be used for faster convergence to an optimum. However, for a fair comparison with the dynamic choice, we used gradient descent (a first-order optimizer) for all optimizations. Table 1 and Fig. 4 show the results for various combinations of \(r_\textbf{q}\) and \(\varepsilon \). In the last column (“exact”), optimization using the sKL divergence was also done when the TT format was not used, even though the KL divergence is well-defined and the introduction of a positive \(\varepsilon \) is not necessary in this case. The additional entries are written in grey to indicate this.

Table 1 Average KL divergence (without approximation) from the ground-truth model \(\theta _{\text {GT}}\) for MHNs obtained by optimizing the sKL divergence with the static choice of \(\varvec{\upvarepsilon }\) given by Eq. (3)

For increasing value of \(r_\textbf{q}\), the TT approximation improves, and accordingly the approximate results tend towards the exact result, although the convergence is quite slow. If \(q_{\theta , i}+\varepsilon <0\) occurred in any iteration of the optimization procedure, the optimization was stopped and the last \(\theta \) matrix was returned. The colors indicate the percentage of runs where this happened.

The first row of Table 1 shows the results of optimization using the KL divergence. It can clearly be seen that, when the TT format is used, optimization using the sKL divergence leads to MHNs that describe the data more closely. The optimal \(\varepsilon \) value is between \(10^{-7}\) and \(10^{-6}\) across all cases where the TT format is used. The results also clearly show that, as explained in Sect. 2.3, if \(\varepsilon \) is chosen too small or too large, we obtain poor optimization results. From the colors we can see that early stopping due to negative entries of \(q_{\theta ,i}+\varepsilon \) does not have a big influence on the optimization results. This is because early stopping often occurred late in the optimization procedure, i.e., after a good estimate for the optimum was already found.

Fig. 4
figure 4

Visualization of the data in Table 1. The dashed line indicates the result when optimizing the KL divergence without the use of the TT format

3.3.2 Dynamic choice

Next, we similarly investigate the dynamic choice of \(\varvec{\upvarepsilon }\) given in Eq. (4), choosing the function f as

$$\begin{aligned} f(x)=\delta \cdot x \end{aligned}$$
(7)

with \(\delta >0\). In Theorem 3 we showed that the average of the sKL divergence converges to the KL divergence in the limit of small i.i.d. Gaussian noise. In our particular application, numerical data show that the assumption of i.i.d. Gaussian noise is violated. Nevertheless, as \(r_{\textbf{q}}\) increases and thus the quality of the approximation improves, we observe in Fig. 5 that the sKL divergence converges to the KL divergence, as already mentioned at the end of Sect. 2.3.2. This statement is true for all values of the parameter \(\delta \), although the details of the convergence may depend on \(\delta \).

Fig. 5
figure 5

Relative difference between the KL divergence and the sKL divergence with the dynamic choice of \(\varvec{\upvarepsilon }\) given by Eqs. (4) and (7), for different TT ranks \(r_{\textbf{q}}\)

We now discuss the results obtained from optimizing the sKL divergence, similar to Sect. 3.3.1. Since optimization using higher-order optimizers is not possible with the dynamic choice, simple gradient descent was used for optimization. The dynamic choice of \(\varvec{\upvarepsilon }\) ensures \(q_{\theta , i}+\varepsilon _i>0\), so optimization was always done until a stopping criterion was satisfied. In Table 2 and Fig. 6 we show numerical results for various combinations of \(\delta \) and \(r_\textbf{q}\). As \(r_\textbf{q}\) is increased, we again observe a convergence towards the exact result, but now at a much faster rate than for the static choice. Table 2 also shows that the results are quite stable with respect to the parameter \(\delta \). For the exact calculation without the TT format, \(\delta \) played no role, since it is only important when encountering negative entries in \(\textbf{q}_\theta \).

Table 2 Average KL divergence (without approximation) from the ground truth model \(\theta _{\text {GT}}\) for MHNs obtained by optimizing the sKL divergence with the dynamic choice of \(\varvec{\upvarepsilon }\) given by Eqs. (4) and (7)

Comparing the static and dynamic choice of \(\varvec{\upvarepsilon }\), it can be seen clearly that dynamically choosing \(\varvec{\upvarepsilon }\) generally leads to MHNs that are closer to the exact results at the same level of approximation. This is because usually only a few entries of \(\textbf{q}_{\theta }\) are negative. Therefore, dynamically choosing \(\varvec{\upvarepsilon }\) leads to an objective function that closely resembles the KL divergence, while the static choice introduces a shift for all entries even when the shift is not needed.

Fig. 6
figure 6

Visualization of the data in Table 2

4 Summary and Outlook

We have introduced a new method to handle negative entries in approximate probability vectors. This method is very general and can be used in a wide variety of applications (see Sect. 1.2 for examples). Moreover, it does not come with significant computational overhead.

We showed that the sKL divergence shares many desirable properties with the KL divergence. We discussed two possible choices of the shift parameters that occur in the sKL divergence. The static choice allows for the use of higher-order optimizers, but it requires tuning of the parameters and leads to a large loss of gradient information. In contrast, the dynamic choice restricts us to first-order optimizers, but it offers more freedom in the choice of the parameters and preserves most of the gradient information. For the dynamic choice we showed that, when negative entries occur due to i.i.d. Gaussian noise, the difference between the KL divergence and the average sKL divergence is quadratic in the strength of the noise and thus goes to zero for small noise.

In this work, we only considered the sKL divergence in the context of approximate discrete probability distributions. The investigation of possible use cases in other contexts is left for future work.

We applied our method to a real-world application, the modeling of cancer progression by Mutual Hazard Networks, where the use of tensor-train approximations can lead to negative entries. We showed that the sKL divergence and the corresponding model results converge to the KL divergence and the exact model results, respectively, when the parameter that controls the quality of the approximation is increased. We also showed that the dynamic choice of the shift parameters leads to a faster convergence to the exact results than the static choice, as expected from the theoretical considerations in Sect. 2.

When using the sKL divergence as an objective function, first-order optimizers are desirable because they allow for more freedom when choosing the parameters of the sKL divergence. So far, we only used a standard gradient-descent optimizer. In future work, we will investigate the effect of different first-order optimizers, including stochastic and momentum-based optimizers.

Regarding MHNs, it was shown in Georg (2022) that the computational complexity of model construction can be reduced from exponential to cubic in the number of events using low-rank tensor formats. The remaining problem of negative entries in the approximate probability vectors is solved by the current work, which thus enables the construction of large Mutual Hazard Networks with \(\gg 30\) events. Expected applications will include as many as 800 events. Furthermore, modifications of the MHN have been conceived which allow for more realistic modeling of tumor progression, but which are currently limited to a very small number of events. We will apply the techniques described in this work to these large and extended MHNs in future work.