1 Introduction

Phase-type (PH) distributions have been employed extensively in applied probability since they often provide exact and explicit solutions to complex stochastic problems. Another attractive property of PH distributions is that they form a dense class in the set of distributions in the positive half-line in the sense of weak convergence (see [Section 3.2.1] Bladt and Nielsen (2017)). However, and despite their denseness, PH distributions are always light-tailed, which may be a problem when heavy tails are present.

At least three approaches to remedy this problem have been introduced in the literature. The first one, originally introduced in Bladt et al. (2015) and called the NPH class of distributions, consists of considering PH distributions scaled by nonnegative discrete random variables, N. This construction principle has the advantage that the resulting distribution maintains the interpretation as being the absorption time of a homogeneous Markov jump process but in an infinite-dimensional state-space. This, indeed, allows for genuinely heavy tails for the resulting distribution. For instance, in Rojas-Nandayapa and Xie (2018), the authors showed that if the scaling component is unbounded (but otherwise arbitrary), then the resulting distribution is always heavy-tailed in terms of non-existent moment generating functions (see also Su and Chen (2006) for more general results). However, their different functionals are in terms of infinite-dimensional matrices, which in practice, can only be computed up to a finite number of terms. More recently, in Albrecher et al. (2021a), the authors considered continuous scaling and showed that closed-form expressions for different functionals of the resulting distributions can be obtained. They denoted this class by CPH. Another advantage of continuous scaling is that it maintains the (finite) dimension of the underlying PH.

A second approach was introduced in Albrecher et al. (2020a) by considering a time-fractional version of the underlying stochastic process dynamics, effectively moving into the semi-Markov domain. Together with subsequent multivariate extensions based on rewards (cf. Albrecher et al. (20212020b)), these models were shown to be feasible models for applications such as non-life insurance modeling. More recently, Bladt (2021) showed that these models are relevant in describing lifetimes and performing the corresponding life-insurance calculations.

The third approach, introduced in Albrecher and Bladt (2019), consists of allowing the Markov jump process to be time-inhomogeneous in the construction principle of PH distributions leading to the class of inhomogeneous phase-type (IPH) distributions. An advantage of this approach is that one gains substantial flexibility on the tails: not only are heavy tails possible but also, e.g., lighter tails than exponential-decay can be obtained. Further extensions to covariate-dependent distributions can be found in Albrecher et al. (2021b), which is particularly well-suited for survival analysis applications.

Estimation of PH distributions was initially developed to calibrate such stochastic models to real-life data, and it is a well-developed topic in the literature. It is typically done via an expectation-maximization (EM) algorithm (Asmussen et al. (1996)), although other methods such as an MCMC approach have been introduced (Bladt et al. (2003)). More recent trends have moved towards considering PH-based models purely as flexible models for statistical fitting, irrespectively of their explicit and closed-form formulas. This data-driven approach is particularly attractive compared to other classical alternatives (for instance, kernel smoothing) since there is the implicit interpretation of an underlying process traversing through different states before it terminates, which is easy to justify in many application areas. Algorithms for discretely-scaled PH distributions, IPH models, and continuously-scaled PH distributions can be found, respectively, in Bladt and Rojas-Nandayapa (2018), Albrecher et al. (2020), and Albrecher et al. (2021a). To the best of the authors’ knowledge, an EM-based estimation procedure for fractional phase-type distributions (also called matrix Mittag-Leffler distributions) has not been considered before the present work, with Albrecher et al. (2020a) performing a purely numerical multi-dimensional maximum-likelihood estimation.

The primary purpose of this paper is to present a unified theory that englobes the above approaches to produce heavy-tailed phase-type distributions. The construction principle of the proposed models is simple to conceptualize and can be seen as a matrix extension of the frailty model in survival analysis. However, the flexibility of the underlying Markov structure allows for very different objects to be constructed as special cases. More precisely, we study IPH distributions with intensity matrices scaled by any nonnegative random variable. In other words, we impose both a random and a deterministic component which modify the speed at which the finite state-space is traversed by the Markov process, such that absorption times can possess any desired tail and body behavior, in particular obtaining heavy-tailed distributions. Inhomogeneous generalizations of Albrecher et al. (2021a); Rojas-Nandayapa and Xie (2018), the matrix Mittag-Leffler models of Albrecher et al. (2020a), and randomly scaled generalizations of Albrecher and Bladt (2019); Albrecher et al. (2021b) (with the possibility of missing covariates) are all comprised in this rich class.

In terms of physical interpretation, the latent variables play different roles. The underlying Markov dynamics aim to model heterogeneity by assuming that unobserved traversing of states has occurred. In contrast, the interpretation of the scaling component is closely related to the statistical concept of frailty. Recall that frailty models (see, e.g., Wienke (2010) for a comprehensive account of such models) specify a multiplicative random effect on the hazard rate of a distribution, effectively accounting for unobserved covariates in a Cox proportional hazards model. In contrast, we specify a multiplicative random effect on the intensity function of a Markov jump process. Nonetheless, since for IPH distributions, the hazard rate and intensity function are asymptotically equivalent (cf. Albrecher et al. (2021b)), the scaling variable can also be interpreted as accounting for heterogeneity or missing covariates in an asymptotically proportional hazards model.

The secondary aim of the paper is to present multivariate models based on this construction, which can be interpreted as generalizations of the shared and correlated frailty models (cf. Wienke (2010)). We derive EM algorithms for maximum-likelihood estimation of all the proposed models, which can be implemented either in full generality or by simplifying some assumptions and tailoring the methods for the specific application. For pedagogical reasons, we build up the multivariate case from the univariate one, although a top-bottom approach is also possible.

The rest of the paper is organized as follows. In Section 2, we present an overview of the class of IPH distributions and some important properties for our present purposes. In Section 3, we introduce our main univariate model, which we call scaled inhomogeneous phase-type, derive its main properties, give several parametric examples relevant for real-life applications, and propose a generalized EM algorithm for its maximum-likelihood estimation. In Section 4, we present a multivariate extension inspired by the shared frailty model and show how estimation of the proposed models can be performed via EM algorithms. In Section 5, we present a different multivariate extension, now based on the construction principle of correlated frailty models, and derive an EM algorithm for maximum-likelihood estimation. In Section 6, we present several numerical illustrations. Finally, Section 7 concludes.

2 Preliminaries

This section presents the relevant preliminaries on time-inhomogeneous Markov jump-processes and their absorption times. The distributions of the latter times are the building blocks for the scaled models introduced in Section 3. For distributional equality between two random variables XY, we use the notation \(X{\mathop {=}\limits ^{d}}Y\), while the notation \(X\sim F\) for F a distribution function, density, or acronym is understood as X following the distribution uniquely associated with F. Unless stated otherwise, equalities between random objects hold almost surely. For two real-valued functions, gh the terminology \(g(t)\sim h(t)\), as \(t\rightarrow a\in \mathbb {R}\cup \{-\infty ,+\infty \}\) is defined as \(\lim _{t\rightarrow a}g(t)/h(t)=1\). If a is not explicitly mentioned, it is assumed to be \(+\infty\).

Let \(( X_t )_{t \ge 0}\) denote a time-inhomogeneous Markov jump process on the state-space \(E = \{1, \dots , p, p+1\}\), where states \(1,\dots ,p\) are transient and state \(p+1\) is absorbing. In this way, \(( X_t)_{t \ge 0}\) has an intensity matrix of the form

$$\begin{aligned} \varvec{\Lambda }(t)= \left( \begin{array}{cc} \varvec{T}(t) &{} \mathbf{t} (t) \\ {\mathbf{0}} &{} 0 \end{array} \right) \,, \quad t\ge 0\,. \end{aligned}$$

Since \(\varvec{\Lambda }(t)\) is an intensity matrix, the sum of its rows is zero for any time \(t\ge 0\), and so the identity \(\mathbf{t} (t)=- \varvec{T}(t) \, \mathbf{e} ,\) holds, where \(\mathbf{e}\) is the p–dimensional column vector of ones. Moreover, the probability transition matrix \(\varvec{P}(s, t) = \{p_{k,l}(s,t)\}_{k,l \in E}\) of \(( X_t )_{t \ge 0}\), where

$$\begin{aligned} p_{k,l}(s,t) = \mathbb {P}(X_{t}=l \mid X_{s}=k) \,, \quad k,l \in E \,, \end{aligned}$$

is given in terms of the product integral (see Albrecher and Bladt (2019))

$$\begin{aligned} \varvec{P} (s, t) = \prod _s^t\left( \mathbf{I} + \varvec{\Lambda } (u) du \right) =\begin{pmatrix} \prod _s^t\left( \mathbf{I} + \varvec{T}(u) du \right) &{} \mathbf{e} -\prod _s^t\left( \mathbf{I} + \varvec{T}(u) du \right) \mathbf{e} \\ \mathbf{0} &{} 1 \end{pmatrix} \,. \end{aligned}$$

To avoid degeneracies, we assume that the process starts almost surely in a non-absorbing state \(k\le p\) with probabilities given by \(\pi _{k} = {\mathbb P}(X_0 = k)\), \(k = 1,\dots , p\). In vector notation, we write \(\varvec{\pi }= (\pi _1 ,\dots ,\pi _p )\). In the sequel, we follow the convention that greek boldface lowercase letters are row-vectors, while roman boldface lowercase letters are column-vectors. Thus \(\sum _{k=1}^p\pi _k=\varvec{\pi }\mathbf{e} =1\).

The main quantity of interest of such a process for our present purposes is the time taken to reach the absorbing state, denoted by

$$\begin{aligned} \tau = \inf \{ t \ge 0 \mid X_t = p+1 \}\,, \end{aligned}$$

which has an inhomogeneous phase-type distribution (cf. Albrecher and Bladt (2019)) with representation \((\varvec{\pi },\varvec{T}(t))\), and we write \(\tau \sim \text{ IPH }(\varvec{\pi },\varvec{T}(t) )\). Application of such random variables to statistical modeling is often treated for the special case \(\varvec{T}(t) = \lambda (t)\,\varvec{T}\), with \(\lambda (t)\) some known nonnegative real function, known as the intensity function, and \(\varvec{T}\) a fixed sub-intensity matrix. We adopt this approach in the present text. Thus we may simply write \(\tau \sim \text{ IPH }(\varvec{\pi }, \varvec{T}, \lambda )\). The interested reader is referred to Bladt and Nielsen (2017) for a comprehensive account of the \(\lambda \equiv 1\) case and Albrecher and Bladt (2019) for further reading on general IPH distributions.

The restricted class of IPH distributions is nonetheless quite versatile. Whenever \(Y \sim \text{ IPH }(\varvec{\pi }, \varvec{T}, \lambda )\), then there exists a function h such that

$$\begin{aligned} Y {\mathop {\;=\;}\limits ^{d}}h(Z) \,, \end{aligned}$$
(1)

where \(Z \sim \text{ PH }(\varvec{\pi }, \varvec{T})\). More specifically, the relationship between h and \(\lambda\) is given by

$$\begin{aligned} h^{-1}(t) = \int _0^t \lambda (t)dt, \quad t\ge 0, \end{aligned}$$

or in terms of derivatives

$$\begin{aligned} \lambda (t) = \frac{d}{dt}h^{-1}(t) \,. \end{aligned}$$

To make sure that Y is positive, unbounded, and almost surely finite, we require that

$$h^{-1}(t)<\infty \,,\quad \forall t>0\,,\quad \lim _{t\uparrow \infty }h^{-1}(t)=\infty \,.$$

The density \(f_Y\) and survival function \(S_Y\) of \(Y \sim \text{ IPH }(\varvec{\pi }, \varvec{T}, \lambda )\) are explicit in terms of matrix exponential formulas, and given by

$$\begin{aligned} f_Y(y)= & {} \lambda (y)\, \mathbf \pi \exp \left( \int _0^y \lambda (t)dt\ \varvec{T} \right) \mathbf{t}, \quad y\ge 0, \\ S_Y(y)= & {} \mathbf \pi \exp \left( \int _0^y \lambda (t)dt\ \varvec{T} \right) \mathbf{e}, \quad y\ge 0. \end{aligned}$$

The tail behavior of IPH distributions is driven by the asymptotic behavior of the \(\lambda\) function. Table 1 presents an overview of some commonly used intensities and transforms for generating parametric IPH distributions (see Bladt and Yslas (2021)). Applications and estimation can be found, for instance, in Albrecher and Bladt (2019); Albrecher et al. (20202021b). Their names are inspired by the \(p=1\) case, e.g., a matrix-Weibull distribution reduces to the regular Weibull distribution when \(\varvec{T}\) is a \(1\times 1\) matrix. In general, the additional parameters allow for more flexible modeling in the body of the distribution while preserving the same tail behavior as the scalar case.

Table 1 Some IPH distributions with their respective intensities and transforms

3 Scaled inhomogeneous phase-type distributions

In this section, we introduce the main general specification of the paper and then derive some special cases together with a detailed analysis of their specific tail asymptotics. The central assumption underpinning our model is that an individual’s intensity function depends on an unobservable nonnegative random variable \(\Theta\). More specifically, we focus on the case where \(\Theta\) acts multiplicatively on the intensity function, that is

$$\begin{aligned} \lambda (t ; \Theta ) = \Theta \lambda (t), \quad t\ge 0, \end{aligned}$$
(2)

where \(\lambda\) is the baseline intensity function. If we denote by Y a random variable with intensity (2), then we have that

$$\begin{aligned} Y \mid \Theta = \theta \sim \text{ IPH } (\varvec{\pi }, \varvec{T}, \theta \lambda ) \,. \end{aligned}$$
(3)

For the representation of these distributions, we make use of functional calculus. More specifically, if g is an analytic function and \(\varvec{A}\) is a matrix, we can express \(g(\varvec{A})\) by Cauchy’s formula

$$\begin{aligned} g(\varvec{A}) = \frac{1}{2 \pi i} \oint _{\Gamma } g(z)(z\mathbf{I} - \varvec{A}) dz \,, \end{aligned}$$

where \(\Gamma\) is the simple closed path in \(\mathbb {C}\) which encloses the eigenvalues of \(\varvec{A}\) (cf. [Section 3.4.] Bladt and Nielsen (2017) for details).

The following result characterizes the density and survival functions of Y. In particular, observe that the asymptotic behavior of the tail of Y depends on both the shape of \(\mathcal {L}_\Theta\), the Laplace transform of \(\Theta\), and on \(\lambda\). In subsection 3.1, we give an in-depth asymptotic analysis of the new parametric models presented in this paper.

Proposition 3.1

Let Y be given by (3). Then we have that, for \(y\ge 0\),

  1. I

    \(S_{Y}(y) = \varvec{\pi }\mathcal {L}_\Theta ( - h^{-1}(y) \varvec{T}) \mathbf{e}\),

  2. II

    \(f_Y(y) = -\lambda (y) \varvec{\pi }\mathcal {L}_\Theta ^{\prime } ( - h^{-1}(y) \varvec{T}) \mathbf{t}\),

where \(h^{-1}(y)= \int _{0}^{y}\lambda (t)dt\).

Proof

Property (1) follows from

$$\begin{aligned} S_{Y}(y)&= \int \varvec{\pi }\exp ({ \theta h^{-1}(y) \varvec{T}}) \mathbf{e} \, dF_\Theta (\theta ) \\&= \varvec{\pi }\int \exp ({ \theta h^{-1}(y) \varvec{T}}) \, dF_\Theta (\theta ) \mathbf{e} \\&=\varvec{\pi }\int \frac{1}{2 \pi i} \oint _{\Gamma } \exp (z)(z\mathbf{I} - \theta h^{-1}(y) \varvec{T}) dz \, dF_\Theta (\theta ) \mathbf{e} \\&=\varvec{\pi }\frac{1}{2 \pi i} \oint _{\Gamma }\int \exp (z)(z\mathbf{I} - \theta h^{-1}(y) \varvec{T}) dF_\Theta (\theta ) \, dz\,\mathbf{e} \\&= \varvec{\pi }\mathcal {L}_\Theta ( - h^{-1}(y) \varvec{T}) \mathbf{e} \,, \end{aligned}$$

where we have used functional calculus to define the Laplace transform evaluated at a matrix. Taking derivatives in the expression above yields

$$\begin{aligned} f_{Y}(y)&= -\varvec{\pi }\int \theta \lambda (y) \varvec{T}\exp ({ \theta h^{-1}(y) \varvec{T}})\, dF_\Theta (\theta ) \mathbf{e} \end{aligned}$$

from which (2) follows.

The following lemma shows that Y has the same distribution as the transformation of a scaled PH distribution. Such a representation is useful for simulation and for estimation, as is apparent in later sections.

Lemma 3.1

Let Y be given in terms of (3). Then, \(Y {\mathop {=}\limits ^{d}}h(Z /\Theta )\), where \(Z \sim \text{ PH }(\varvec{\pi }, \varvec{T})\), independent of \(\Theta\), and \(h^{-1}(y)= \int _{0}^{y}\lambda (t)\ dt\).

Proof

We now make the following formal definition of a random variable Y satisfying (3).

$$\begin{aligned} {\mathbb P}(h(Z /\Theta )> y)&= \int {\mathbb P}(h(Z /\theta )> y \mid \Theta =\theta ) dF_{\Theta }(\theta ) \\&= \int {\mathbb P}( Z > \theta h^{-1}(y) \mid \Theta =\theta ) dF_{\Theta }(\theta ) \\&= \int \varvec{\pi }\exp ({ \theta h^{-1}(y) \varvec{T}}) \mathbf{e} \, dF_\Theta (\theta ) \\&= S_{Y}(y) \,. \end{aligned}$$

Definition 3.1

A random variable Y is said to have scaled inhomogeneous phase-type distribution (SIPH) with representation \((\varvec{\pi }, \varvec{T}, \lambda )\) and scaling distribution \(F_\Theta\) if its survival function is given by

$$\begin{aligned} S_Y(y) = \varvec{\pi }\mathcal {L}_\Theta \left( - \int _0^y \lambda (t) dt \,\varvec{T}\right) \mathbf{e} , \quad y\ge 0. \end{aligned}$$

We write \(\text{ SIPH }(\varvec{\pi }, \varvec{T}, \lambda , \Theta )\).

Remark 3.1

(Existing special cases of heavy-tailed PH models).

  1. i)

    For \(\lambda \equiv 1\) and \(\Theta \in \mathbb {N}\), almost surely, we obtain the class of NPH distributions introduced in Bladt et al. (2015), while for \(\lambda \equiv 1\) and \(\Theta \in \mathbb {R}_+\), almost surely, we recover the CPH class in Albrecher et al. (2021a); Rojas-Nandayapa and Xie (2018).

  2. ii)

    Consider a Matrix Mittag Leffler (fractional phase-type) random variable \(Y \sim \text{ MML }(\alpha , \varvec{\pi }, \varvec{T})\) as introduced in Albrecher et al. (2020a). Then, it can be shown that

    $$\begin{aligned} Y \ {\mathop {=}\limits ^{d}} \ Z^{1/\alpha } S_{\alpha } = (Z S_{\alpha }^{\alpha } )^{1/\alpha } \,, \end{aligned}$$

    where \(Z \sim \text{ PH }(\varvec{\pi }, \varvec{T})\) and \(S_{\alpha }\) is an independent (positive stable) random variable with Laplace transform given by \(\exp (-u^{\alpha })\), \(\alpha \in (0,1]\). Hence, we have that Y is SIPH distributed with \(h(x) = x^{1/\alpha }\) and \(\Theta = 1/S_{\alpha }^{\alpha }\). This class of distributions is the time-fractional counterpart of PH distributions and can be interpreted as absorption times of a stochastic process that traverses through a finite number of states. The holding times of the latter are Mittag-Leffler distributed, which are regularly varying, and thus can possess abnormally large holding times compared to a Markov framework. However, the boundary case \(\alpha =1\) corresponds to the usual exponential holding times, and thus there is a regime-shift with respect to tail behavior.

  3. iii)

    When the scaling component \(\Theta\) degenerates to a point \(\Theta \equiv k\in \mathbb {R}_+\), we recover the class of IPH distributions. This also implies that the class of SIPH distributions, with a given and fixed intensity, is dense in the class of distributions on the positive real line. The argument is omitted, but it is a simple application of convergence through the diagonal of an array, for instance, by choosing a sequence of scalings \(\Theta _n\) with constant mean k and variances shrinking to zero.

Remark 3.2

Recall that for a continuous and positive random variable Y, the hazard function \(\mu _Y\) is given by

$$\begin{aligned} \mu _Y(t) = \frac{f_Y(t)}{S_Y(t)}, \quad t\ge 0. \end{aligned}$$

Sometimes, it is convenient to deal with the cumulative or integrated hazard function \(M_Y\), which is given by

$$\begin{aligned} M_Y(t) = \int _0^{t} \mu _Y (s) ds = -\log (S_Y(t)), \quad t\ge 0. \end{aligned}$$

The classical frailty model in survival analysis assumes that the hazard function of an individual depends on an unobservable random variable \(\Theta\). More specifically, it assumes that \(\Theta\) acts multiplicatively on a baseline hazard function \(\mu\), that is

$$\begin{aligned} \mu (t; \Theta ) = \Theta \mu (t), \quad t\ge 0. \end{aligned}$$
(4)

Here, the random variable \(\Theta\) is known as the frailty. If we denote by Y the random variable with the above hazard, then the survival function of \(Y \mid \Theta = \theta\) is given by

$$\begin{aligned} S_{Y|\Theta }(y | \theta ) = \exp \left( -\theta \int _0^y \mu (t)dt \right) = \exp \left( -\theta M(y)\right) \,. \end{aligned}$$

Thus, the unconditional survival function of Y is given by

$$\begin{aligned} S_Y(y) = \int _0^\infty S_{Y|\Theta }(y | \theta ) dF_\Theta (\theta ) = \int _0^\infty \exp \left( -\theta M(y)\right) dF_\Theta (\theta ) = \mathcal {L}_\Theta (M(y)) \,. \end{aligned}$$

Furthermore, model (4) can incorporate covariates \(\mathbf{X} = (X_1,\dots , X_{q})^{\top }\in \mathbb {R}^q\) in a similar way to the Cox’s proportional hazards model via

$$\begin{aligned} \mu (t; \Theta , \mathbf{X} ) = \Theta \mu (t) \exp ( \varvec{\beta }\mathbf{X} ), \quad t\ge 0, \end{aligned}$$

where \(\varvec{\beta }\in \mathbb {R}^q\) is a q-dimensional parameter row vector. Note that when the frailty degenerates to \(\Theta \equiv 1\), one recovers the proportional hazards model, meaning that the frailty model generalizes the proportional hazards model. Commonly employed distributions as frailties include the Gamma and the positive stable distributions, among others.

In Albrecher et al. (2021b), it was shown that the intensity function of an IPH distribution is asymptotically equivalent to its hazard function. More specifically, we have that \(\lambda (t) \sim C \mu (t)\) as \(t \rightarrow \infty\) with \(C>0,\) a positive constant. In particular, when \(p = 1\), the previous asymptotic result becomes equality. It follows that the frailty model is a special case of our more general matrix specification of SIPH distributions, when \(p=1\).

Remark 3.3

(Incorporating regressors). As in the frailty model, we can introduce covariates into (2) via

$$\begin{aligned} \lambda (t ; \Theta , \ \mathbf{X} ) = \Theta \lambda (t) \exp ( \mathbf \beta \mathbf{X} ),\quad t\ge 0. \end{aligned}$$

In this case, we write \(Y \sim \text{ SIPH }(\varvec{\pi }, \varvec{T}, \lambda , \Theta , \varvec{\beta })\) to denote a random variable with above intensity. Note that the proportional intensities model introduced in Albrecher et al. (2021b) is retrieved if the scaling distribution degenerates to \(\Theta \equiv 1\) for all individuals. Consequently, the SIPH model is a generalization of the proportional intensities model.

In what follows, we mostly restrict ourselves to the model (2) without covariates, the extension being straightforward but somewhat distracting to the current train of thought. Moreover, we assume that \(\Theta\) is a continuous random variable unless stated otherwise.

3.1 Novel examples

Next, we present a suite of new examples that arise naturally as matrix extensions of some well-known frailty models, providing along the way some insight into the precise asymptotic behavior of the proposed models. In Appendix 1, the definitions of the different classes of heavy-tailed distributions are provided.

Example 3.4

(Gamma Scaling). Consider \(\Theta \sim \text{ Gamma }(\alpha , 1)\), \(\alpha >0\), with Laplace transform

$$\begin{aligned} \mathcal {L}_\Theta (u) = (1 + u)^{-\alpha }, \quad u\ge -1. \end{aligned}$$

Then, the survival function \(S_{Y}\) of Y is given by

$$\begin{aligned} S_{Y}(y) = \varvec{\pi }( \varvec{I} - h^{-1}(y) \varvec{T})^{-\alpha } \mathbf{e} , \quad y\ge 0. \end{aligned}$$

As for the matrix-Pareto type II laws introduced in Albrecher et al. (2021a), taking more general \(\Theta \sim \text{ Gamma }(\alpha , \gamma )\), \(\gamma >0\), results in the same class of distributions. For this reason, we work only with \(\text{ Gamma }(\alpha , 1)\). Consider now the particular case \(\lambda (y) = \eta y^{\eta - 1}\), \(\eta >0\), then

$$\begin{aligned} S_{Y}(y) = \varvec{\pi }( \varvec{I} - y^{\eta } \varvec{T})^{-\alpha } \mathbf{e} \,. \end{aligned}$$

We call this the Matrix-Burr distribution.

Regarding the asymptotic behavior, we have that

$$\begin{aligned} S_{Y}(y) \sim C (h^{-1}(y))^{-\alpha } \,, \end{aligned}$$

where C is a positive constant, which follows from an eigenvalue decomposition of \(\varvec{T}\). The first-order precise asymptotics for the different intensities from Table 1 are provided in Table 2, where D, b, and c denote positive real-valued constants, which may change between intensities, but we write the same symbol for display purposes. Throughout the rest of this section, we use the same notational convention.

Table 2 Asymptotics for Gamma scaling

Example 3.4

(Positive stable scaling). Consider \(\Theta\) positive stable with stability parameter \(\alpha \in (0,1]\). Then

$$\begin{aligned} S_{Y}(y) = \varvec{\pi }\exp ( - (-\varvec{T})^{\alpha } (h^{-1}(y))^{\alpha }) \mathbf{e} , \quad y\ge 0. \end{aligned}$$

As a particular case, take \(\lambda (y) = \eta y^{\eta - 1}\), \(\eta >0\). Then

$$\begin{aligned} S_{Y}(y) = \varvec{\pi }\exp ( - (-\varvec{T})^{\alpha } y^{\eta \alpha }) \mathbf{e} \,. \end{aligned}$$

It was noted in Albrecher et al. (2021a) that \((\varvec{\pi }, - (-\varvec{T})^{\alpha })\) is a PH representation. Thus, some simple calculations show that these distributions span the same class as the matrix-Weibull laws introduced in Albrecher and Bladt (2019). This is in contrast to the class of CPH distributions with stable mixing in Albrecher et al. (2021a), which only span the matrix-Weibull laws with \(\eta \in (0,1)\).

Regarding their asymptotic behavior, we have

$$\begin{aligned} S_{Y}(y) \sim C \exp (-b (h^{-1}(y))^{\alpha }) \,. \end{aligned}$$

Table 3 gives the precise asymptotics for the different intensities of Table 1.

Table 3 Asymptotics for positive stable scaling

Example 3.6

(Inverse Gaussian scaling). Consider inverse Gaussian scaling with parameters \(\nu >0\) and \(\eta >0\) and density

$$\begin{aligned} f_\Theta (\theta ) = \frac{\sqrt{\eta }}{\sqrt{2 \pi \theta ^3}} \exp \left( -\frac{\eta }{2 \nu ^2 \theta } (\theta - \nu )^2 \right) ,\quad \theta >0. \end{aligned}$$

Then, the corresponding Laplace transform of \(\Theta\) is given by

$$\begin{aligned} \mathcal {L}_\Theta (u) = \exp \left( -\frac{\eta \sqrt{1 + 2 \nu ^2 u / \eta }}{\nu } + \frac{\eta }{\nu } \right) ,\quad u\ge 0. \end{aligned}$$

We take the particular case \(\nu = 1\) and \(\sigma ^2 = 1/\eta\). In this way

$$\begin{aligned} \mathcal {L}_\Theta (u) = \exp \left( \frac{1}{\sigma ^2} \left( 1 - \sqrt{1 + 2 \sigma ^2 u }\right) \right) \,. \end{aligned}$$

Thus,

$$\begin{aligned} S_Y(y) = \varvec{\pi }\exp \left( \frac{1}{\sigma ^2} \left( \varvec{I} - \sqrt{\varvec{I} - 2 \sigma ^2 h^{-1}(y) \varvec{T}}\right) \right) \mathbf{e} , \quad y\ge 0. \end{aligned}$$

Regarding the asymptotic behavior, we have that

$$\begin{aligned} S_{Y}(y) \sim C \exp ( -b (h^{-1}(y))^{1/2}) \,. \end{aligned}$$

Table 4 gives the precise asymptotics for the different intensities of Table 1.

Table 4 Asymptotics for inverse Gaussian scaling

Example 3.7

(PVF scaling). Consider the family of power variance function (PVF) distributions with Laplace transform

$$\begin{aligned} \mathcal {L}_{\Theta }(u) = \exp \left( \frac{\eta (1-\gamma )}{\gamma } \left( 1- \left( 1 + \frac{\nu u }{\eta (1- \gamma )} \right) ^{\gamma } \right) \right) , \quad u\ge 0, \end{aligned}$$

where \(\nu >0\), \(\eta >0\) and \(0 < \gamma \le 1.\) This family includes the Gamma, inverse Gaussian and the positive stable distributions as particular cases. Here we assume that \(\nu = 1\), which results in

$$\begin{aligned} S_Y(y) = \varvec{\pi }\exp \left( \frac{\eta (1-\gamma )}{\gamma }\left( \varvec{I} - \left( {\varvec{I} - \frac{h^{-1}(y)}{\eta (1 - \gamma )} \varvec{T}}\right) ^{\gamma }\right) \right) \mathbf{e} ,\quad y\ge 0. \end{aligned}$$

Regarding the asymptotic behavior, we have that

$$\begin{aligned} S_{Y}(y) \sim C \exp (-b(h^{-1}(y))^{\gamma }) \,, \end{aligned}$$

which results in the same asymptotics of Table 3 for the positive stable case, but with \(\alpha\) replaced by \(\gamma\).

Example 3.8

(Compound Poisson scaling). Consider a compound model \(\Theta = \sum _{i = 1}^N V_i\) with \(V_1, V_2, \dots\) i.i.d. random variables independent of N. In general, the Laplace transform of \(\Theta\) is given by

$$\begin{aligned} \mathcal {L}_\Theta (u) = \mathcal {L}_N (- \log \mathcal {L}_V(u)),\quad u\ge 0. \end{aligned}$$

In particular, for \(V\sim \text{ Gamma }(\alpha , 1)\) and \(N \sim \text{ Poisson }(\rho )\), we obtain

$$\begin{aligned} \mathcal {L}_\Theta (u) = \exp \left( - \rho \left( 1- \left( {1 + u} \right) ^{-\alpha } \right) \right) \,. \end{aligned}$$

Thus,

$$\begin{aligned} S_Y(y) = \varvec{\pi }\exp \left( - \rho \left( \varvec{I}- \left( \varvec{I} - h^{-1}(y) \varvec{T}\right) ^{-\alpha } \right) \right) \mathbf{e} , \quad y\ge 0. \end{aligned}$$

Note that this distribution has an atom at infinity with probability \(\exp (-\rho )\), corresponding to the probability of \({\mathbb P}(N= 0)\). In survival analysis terms, this means that an individual may never experience the event of interest with such probability. Considering \(N+1\) instead of N removes such an atom.

Example 3.9

(Discrete scaling). Assume that \(\Theta\) is a discrete random variable taking values in \(\{\eta _1, \eta _2, \dots \}\subset \mathbb {R}_+\) with corresponding probabilities \(\varvec{\alpha }= (\alpha _1, \alpha _2, \dots )\), that is, \({\mathbb P}(\Theta = \eta _i ) = \alpha _i\), \(i=1,2,\dots\). Then,

$$\begin{aligned} S_Y(y) = \sum _i \alpha _i \varvec{\pi }\exp \left( \eta _i \varvec{T}h^{-1}(y) \right) \mathbf{e} , \quad y\ge 0. \end{aligned}$$

Define the linear transformation \(\tilde{\varvec{T}}\) on \(\mathbb {R}^\mathbb {N}\) given by

$$\begin{aligned} \tilde{\varvec{T}} = \begin{pmatrix} \varvec{T}\eta _1 &{} {\mathbf{0}}&{} \cdots \\ {\mathbf{0}} &{}\varvec{T}\eta _2 &{} \cdots \\ \vdots &{} \vdots &{} \ddots \end{pmatrix}. \end{aligned}$$

Then, we can rewrite the survival function of Y as

$$\begin{aligned} S_Y(y) = (\varvec{\alpha }\otimes \varvec{\pi }) \exp \left( \tilde{\varvec{T}} h^{-1}(y) \right) \tilde{\mathbf{e }}, \quad y\ge 0, \end{aligned}$$

where \(\otimes\) denotes the Kronecker product, and \(\tilde{\mathbf{e }}\) is a column vector of ones of appropriate dimension. This can be thought of as an infinite-dimensional IPH distribution. The case \(\lambda \equiv 1\) recovers the class of NPH distributions introduced in Bladt et al. (2015).

Note that another approach to study the asymptotic behavior, and that is particularly convenient in the discrete scaling case, is to use the representation \(Y = h(Z/\Theta )\), so that

$$\begin{aligned} {\mathbb P}(Y>y ) = {\mathbb P}( Z/\Theta > h^{-1}(y)) = S_{Z/\Theta }(h^{-1}(y)) \,, \end{aligned}$$

and employ the asymptotics of \(Z/\Theta\). For instance, taking \(\Theta \sim \text{ Gamma }(\alpha , 1)\), we have that \(Z/\Theta\) is regularly varying with index \(\alpha\) (see Albrecher et al. (2021a) for details). This leads to the same asymptotic results in Table 2 for the different choices of intensities \(\lambda\). For the discrete scaling, we could take, for instance, \(\Theta\) with Zeta distribution leading to the same asymptotic results.

As a second case, take \(V := 1/\Theta\) with Weibull-type tail so that VZ has Weibull-type tail with shape parameter in (0, 1) (see Rojas-Nandayapa and Xie (2018)). Thus, the asymptotic behavior for the different intensities resemble those in Table 3.

Example 3.10

(Missing covariates in the proportional intensities model). Consider the proportional intensities model (also known as PH regression) introduced in Albrecher et al. (2021b) with vectors of observed and unobserved covariates \(\mathbf{X} _1\) and \(\mathbf{X} _2\), respectively. Namely, the intensity is of the form

$$\begin{aligned} \lambda (t; \mathbf{X} _1, \mathbf{X} _2) = \lambda (t) \exp ( \varvec{\beta }_1 \mathbf{X} _1 + \varvec{\beta }_2 \mathbf{X} _2 ),\quad t\ge 0. \end{aligned}$$

Given that the vector \(\mathbf{X} _2\) is unknown, the model cannot be employed in practice. However, we can assume that

$$\begin{aligned} \Theta := \exp ( \varvec{\beta }_2 \mathbf{X} _2 ) \end{aligned}$$

is an unobserved random variable independent of \(\mathbf{X} _1\). In this way, the scaled intensity model can be employed to account for the effect of omitted covariates by considering a parametric model for \(\Theta\). Such additional random component can thus help account for additional variability observed in data that cannot be explained by a simpler model.

3.2 Parameter estimation

In order to derive an EM algorithm for SIPH distribution, we first recall the corresponding algorithm for CPH distributions in Albrecher et al. (2021a) (see Bladt and Rojas-Nandayapa (2018) for the discrete scaling case). Consider \(y_1, \dots , y_K\) an i.i.d. sample from a CPH distributed random variable Y, which we will also denote by \(\mathbf{y}\). Here, we assume that the scaling component \(\Theta\) belongs to a parametric family depending on the parameter vector \(\varvec{\alpha }\) and denote by \(f_\Theta\) its corresponding density. We now make the following definitions. Let \(B_k\) be the number of times the underlying Markov jump process of Y starts in state k, \(N_{kl}\) the total number of transitions from state k to l until absorption, \(N_k\) the number of times that k was the last state to be visited before absorption, and finally, let \(Z_k\) be the cumulated time that the Markov jump process spent in state k. The detailed routine for estimation of CPH distributions is given in Algorithm 1.

figure a

We now derive a generalized EM algorithm for maximum-likelihood estimation of SIPH distributions. Assume that \(\lambda (\,\cdot \, ; \mathbf \eta )\ge 0\) is a nonnegative parametric function depending on the vector \(\mathbf \eta\). Let \(Y \sim \text{ SIPH }(\varvec{\pi }, \varvec{T}, \lambda (\,\cdot \, ; \mathbf \eta ), \Theta , \varvec{\beta })\), then

$$Y {\mathop {=}\limits ^{d}}h(\exp (-\varvec{\beta }\mathbf{X} ) Z /\Theta ; \mathbf \eta ),$$

where \(Z \sim \text{ PH }(\varvec{\pi }, \varvec{T})\). In particular, this implies that \(h^{-1}(Y; \mathbf \eta ) \exp (\varvec{\beta }\mathbf{X} ) {\mathop {=}\limits ^{d}}Z /\Theta\), meaning that \(h^{-1}(Y; \mathbf \eta ) \exp (\varvec{\beta }\mathbf{X} )\) is scaled PH distributed. Consider now \(y_1,\dots ,y_K\) an i.i.d. sample from this Y, then the EM algorithm for parameter estimation is the following.

figure b

Proposition 3.2

Algorithm 2 increases the likelihood function at each iteration. Since for fixed p, the likelihood of SIPH distributions is bounded, convergence towards a (possibly local) maximum is guaranteed.

Proof

By the change of variable theorem, we have that

$$\begin{aligned} f_{Y}(y)&= f_{Z/\Theta }(h^{-1}(y;\mathbf \eta )\exp (\varvec{\beta }\mathbf{X} ) ; \mathbf \pi ,\varvec{T},\mathbf \alpha ) \lambda (y;\mathbf \eta ) \exp (\varvec{\beta }\mathbf{X} ) ,\quad y\ge 0. \end{aligned}$$

Consider parameter values \((\mathbf \pi _i,\varvec{T}_i,\mathbf \alpha _i,\mathbf \eta _i, \varvec{\beta }_i)\) after the i-th iteration. Then the data log-likelihood after the i-th iteration is given by

$$\begin{aligned} { l( \mathbf \pi _i,\varvec{T}_i,\mathbf \alpha _i,\mathbf \eta _i, \varvec{\beta }_i ; \mathbf{y} , \mathbf{X} )} =& \sum _{n = 1}^{K} \log ( f_{Z/\Theta }(h^{-1}(y_n;\mathbf \eta _i) \exp (\varvec{\beta }_i \mathbf{X} _n ) ;\mathbf \pi _i,\varvec{T}_i,\mathbf \alpha _i )) \\ &+ \log ( \lambda (y_n;\mathbf \eta _i) ) + \varvec{\beta }_i\mathbf{X} _n \,. \end{aligned}$$

In the \((i + 1)\)-th iteration, we first obtain \((\mathbf \pi _{i+1},\varvec{T}_{i+1},\mathbf \alpha _{i+1})\) in 1. so that

$$\begin{aligned} {l( \mathbf \pi _i,\varvec{T}_i,\mathbf \alpha _i,\mathbf \eta _i, \varvec{\beta }_i ; \mathbf{y} , \mathbf{X} )} \le& \sum _{n = 1}^{K} \log ( f_{Z/\Theta }(h^{-1}(y_n;\mathbf \eta _i) \exp (\varvec{\beta }_i \mathbf{X} _n ) ; \ \mathbf \pi _{i+1},\varvec{T}_{i+1},\mathbf \alpha _{i+1} )) \\ &+ \log ( \lambda (y_n;\mathbf \eta _i))+ \varvec{\beta }_i \mathbf{X} _n \\ &= l( \mathbf \pi _{i+1},\varvec{T}_{i+1},\mathbf \alpha _{i+1},\mathbf \eta _{i}, \varvec{\beta }_{i} ; \mathbf{y} , \mathbf{X} ) \,. \end{aligned}$$

Finally, by 2.

$$\begin{aligned} l( \mathbf \pi _i,\varvec{T}_i,\mathbf \alpha _i,\mathbf \eta _i, \varvec{\beta }_i ; \mathbf{y} , \mathbf{X} )&\le arg\,max_{(\mathbf \eta , \varvec{\beta })} l( \mathbf \pi _{i+1},\varvec{T}_{i+1},\mathbf \alpha _{i+1},\mathbf \eta , \varvec{\beta }; \mathbf{y} , \mathbf{X} ) \\&= l( \mathbf \pi _{i+1},\varvec{T}_{i+1},\mathbf \alpha _{i+1},\mathbf \eta _{i+1}, \varvec{\beta }_{i+1} ; \mathbf{y} , \mathbf{X} ) \,. \end{aligned}$$

Remark 3.4

The optimization problem

$$\begin{aligned} arg\,max_{(\mathbf \eta , \varvec{\beta })} \sum _{n=1}^{K} \log (f_{Y}(y_n; \hat{\varvec{\pi }}, \hat{\varvec{T}}, \hat{\mathbf \alpha }, \mathbf \eta , \varvec{\beta })) \end{aligned}$$
(5)

of Algorithm 2 is computationally heavy. However, observe that fewer iterations of any optimization routine are sufficient for the proof and conclusion of Proposition 3.2 to hold, and full convergence of (5) is not necessary. For instance, one step of the \(arg\,max\) routine can already provide good results.

Remark 3.5

(Incorporating right-censoring). Algorithm 2 can be modified to work with censored data. We illustrate the changes by considering only the case of right-censoring since it is the most common scenario in survival analysis applications. However, left-censoring and interval-censoring can be treated by similar means. In such a case, we no longer observe \(Y = y\) but instead only that \(Y \in [v, \infty )\). By monotonicity of h, we have that \(h^{-1}(Y; \mathbf \eta ) \exp (\varvec{\beta }\mathbf{X} ) \in [h^{-1}(v; \mathbf \eta ) \exp (\varvec{\beta }\mathbf{X} ) , \infty )\), which can be interpreted as a censored observation of a scaled PH distributed random variable. Moreover, in Albrecher et al. (2021a) (and Bladt and Rojas-Nandayapa (2018)), a modified EM algorithm for the estimation of scaled PH distributions is presented for the case of censored observations. This means that the main change in Algorithm 2 is in step 2, where we must now compute

$$\begin{aligned} (\hat{\mathbf \eta }, \hat{\varvec{\beta }}) = arg\,max_{(\mathbf \eta , \varvec{\beta })}&\sum _{n \,:\, y_n \,\text {observed}}^{K} \log (f_{Y}(y_n; \hat{\varvec{\pi }}, \hat{\varvec{T}}, \hat{\mathbf \alpha }, \mathbf \eta , \varvec{\beta })) \\&+ \sum _{n \,:\, y_n \,\text {censored}}^{K} \log (S_{Y}(y_n; \hat{\varvec{\pi }}, \hat{\varvec{T}}, \hat{\mathbf \alpha }, \mathbf \eta , \varvec{\beta })) \,. \end{aligned}$$

3.3 Estimation for fractional PH distributions

A key distinction of the matrix Mittag-Leffler distribution (or fractional PH), with respect to the other models introduced in Section 3.1, is that the transformation \(h(x) = x^{1/\alpha }\) and the mixing distribution \(\Theta = 1/S_{\alpha }^{\alpha }\) depend on the same parameter \(\alpha\). This makes statistical estimation very challenging by ad-hoc methods, and thus embedding into the SIPH class is useful for this purpose. Note that the transformation parameters are different from the scaling component’s parameters for the previously presented models, and this last scenario is the central assumption in the derivation of Algorithm 2. Thus, special treatment must be taken for the estimation of matrix Mittag-Leffler distributions when seen as SIPH distributions. This is now solved by employing a modified EM algorithm, the details given in Algorithm 3.

figure c

By the same method of proof of Algorithm 2, one can show that Algorithm 3 increases the likelihood in each iteration, and hence we omit the details for brevity.

4 Shared scaling

This section presents a multivariate extension of SIPH distributions, inspired by the construction principle of the shared frailty model. The key idea is to think of an underlying random variable which is a common scaling factor to all the coordinates of an independent random vector, creating dependency and heavy-tailedness all at once through the same mechanism.

4.1 A class of multivariate CPH distributions

Before going into full generality, we consider the case where there is no deterministic time-transform component. This allows for a more transparent treatment with explicit formulas. Thus, consider the conditionally independent random variables \(\mathbf{Y} =(Y_1,\dots ,Y_d)^{\top }\) given \(\Theta =\theta\) such that

$$\begin{aligned} Y_i \mid \Theta = \theta \sim \text{ PH }(\mathbf \pi _i,\theta \varvec{T}_i) \,, \quad i = 1,\dots ,d \,. \end{aligned}$$

Then, the joint survival function of \(\mathbf{Y}\) is given by

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} )&= \int {\mathbb P}(Y_1>y_1,\dots ,Y_d>y_d \mid \Theta =\theta )dF_\Theta (\theta ) \\&= \int \prod _{i=1}^d \mathbf \pi _i \exp \left( {\theta \varvec{T}_i y_i}\right) \mathbf{edF}_\Theta (\theta ) \\&= \int (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d )\exp \left( {\theta (\varvec{T}_1y_1\oplus \cdots \oplus \varvec{T}_dy_d)}\right) \mathbf{e}\ dF_\Theta (\theta ) \\&= (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d ) \mathcal {L}_\Theta (-(\varvec{T}_1y_1\oplus \cdots \oplus \varvec{T}_d y_d)) \mathbf{e} , \quad y_i\ge 0 \,\, i=1,\dots ,d, \end{aligned}$$

where \(\oplus , \otimes\) denote the Kronecker sum and product, respectively. In particular, this yields the joint density

$$\begin{aligned} f_\mathbf{Y }(\mathbf{y} )&= (-1)^{d} (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d ) \mathcal {L}_\Theta ^{(d)}(-(\varvec{T}_1y_1\oplus \cdots \oplus \varvec{T}_d y_d)) \tilde{\mathbf{t }} ,\quad y_i\ge 0 \,\, i=1,\dots ,d, \end{aligned}$$

where \(\tilde{\mathbf{t }} = \mathbf{t}_1\otimes \cdots \otimes \mathbf{t}_d\) and \(\mathcal {L}_\Theta ^{(d)}(u)\) is the derivative of order d of \(\mathcal {L}_\Theta (u)\), which can again be shown by the use of functional calculus through Cauchy’s formula. Moreover, marginally we get continuously scaled PH behavior:

$$\begin{aligned} Y_i \sim \text{ CPH }(\mathbf \pi _i,\varvec{T}_i,\Theta ) \,, \quad i = 1, \dots , d \,. \end{aligned}$$

Alternatively, it is easy to see that \(\mathbf{Y}\) has representation \((Y_1, \dots , Y_d)^{\top } = (Z_1, \dots ,Z_d)^{\top }/ \Theta\), where \(Z_i\) are independent \(\text{ PH }(\mathbf \pi _i,\varvec{T}_i)\) distributed random variables independent of \(\Theta\), \(i =1, \dots , d\). Indeed,

$$\begin{aligned} {\mathbb P}(Y_1> y_1, \dots ,Y_d>y_d )&= \int {\mathbb P}(Y_1> y_1, \dots ,Y_d>y_d \mid \Theta = \theta ) dF_\Theta (\theta ) \\&= \int {\mathbb P}(Z_1> \theta y_1, \dots ,Z_d>\theta y_d \mid \Theta = \theta ) dF_\Theta (\theta ) \\&= \int \prod _{i=1}^d \mathbf \pi _i \exp \left( {\theta \varvec{T}_i y_i}\right) \mathbf{e} dF_\Theta (\theta ) \\&= S_\mathbf{Y }(\mathbf{y} ) \,. \end{aligned}$$

These multivariate distributions were studied from another perspective in Furman et al. (2021), where the authors derived some properties in the context of risk management. We presently derive some probabilistic properties, provide an estimation method, and extend the class to allow for deterministic time transforms. In the next section we also allow for scaling of different components of the random vector by different (but correlated) scaling random variables. Since these distributions will be the building blocks of the more general time-inhomogeneous multivariate models presented in Section 4.3, a good understanding of the former facilitates the treatment of the latter.

Example 4.1

(Gamma scaling). Consider \(\Theta \sim \text{ Gamma }(\alpha , 1)\), \(\alpha >0\), then the joint survival function of \(\mathbf{Y}\) is given by

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} ) = (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d ) \left( \varvec{I}-(\varvec{T}_1 y_1\oplus \cdots \oplus \varvec{T}_d y_d)\right) ^{-\alpha } \mathbf{e} , \quad y_i\ge 0 \,\, i=1,\dots ,d. \end{aligned}$$

This distribution can be seen to be a matrix version of Mardia’s multivariate Pareto distribution (see Mardia et al. (1962)).

4.2 Parameter estimation: multivariate CPH distributions

We now present a generalized EM algorithm for maximum-likelihood estimation of the class of multivariate CPH distributions introduced previously. The complete data is the scaling component \(\Theta\) together with the conditionally independent Markov jump processes paths. We further assume that \(\Theta\) belongs to a parametric family depending on the vector \(\varvec{\alpha }\) and denote by \(f_\Theta\) its corresponding density.

Consider observations \(\mathbf{y} _n = (y_{n}^{(1)}, \dots , y_{n}^{(d)})^{\top }\), \(n =1 ,\dots , K\), from a multivariate CPH distributed random vector, and let \(\tilde{\mathbf{y }}\) denote the whole data set. We also denote by \(\tilde{\varvec{\pi }}\) and \(\tilde{\varvec{T}}\) the sets of parameters \(\{\varvec{\pi }_1, \dots , \varvec{\pi }_d\}\) and \(\{\varvec{T}_1,\dots , \varvec{T}_d\}\), respectively, and \(\pi _{k}^{(i)}\) and \(t_{kl}^{(i)}\) to refer to the entries of \(\varvec{\pi }_i\) and \(\varvec{T}_i\), \(i = 1, \dots , d\). In order to write down the complete likelihood \(L_c(\tilde{\varvec{\pi }}, \tilde{\varvec{T}},\mathbf \alpha ;\tilde{\mathbf{y }})\), we need the following definitions. For each \(i = 1, \dots ,d\), let \(B_k^i\) be the number of times the underlying Markov jump process of \(Y_i\) starts in state k, \(N_{kl}^i\) the total number of transitions from state k to l until absorption, \(N_k^i\) the number of times that k was the last state to be visited before absorption, and finally, let \(Z_k^i\) be the cumulated time that the Markov jump process spent in state k.

Then, the complete likelihood is given by

$$\begin{aligned}&L_c(\tilde{\varvec{\pi }}, \tilde{\varvec{T}}, \mathbf \alpha ; \tilde{\mathbf{y }} ) \\&\quad = f_\Theta (\theta ; \mathbf \alpha ) \prod _{i=1}^{d} \prod _{k=1}^{p_i}(\pi _{k}^{(i)})^{B_{k}^i}\prod _{k=1}^{p_i}\prod _{l=1, l\ne k}^{p_i} \left( \theta t_{kl}^{(i)}\right) ^{N_{kl}^i}\exp \big ( -\theta t_{kl}^{(i)}Z_{k}^i \big ) \\&\quad \quad \times \prod _{k=1}^{p_i}\left( \theta t_{k}^{(i)} \right) ^{N_{k}^i}\exp \big (-\theta t_{k}^{(i)} Z_{k}^i \big ) \,, \end{aligned}$$

with corresponding log-likelihood (discarding the terms which do not depend on any parameters)

$$\begin{aligned} { l_c(\tilde{\varvec{\pi }}, \tilde{\varvec{T}},\mathbf \alpha ; \tilde{\mathbf{y }}) }\\&= \sum _{i=1}^{d} \sum _{k=1}^{p_i} {B_{k}^i} \log \left( \pi _{k}^{(i)} \right) + \sum _{i=1}^{d} \sum _{k=1}^{p_i}\sum _{l=1, l\ne k}^{p_i} N_{kl}^i \log \left( t_{kl}^{(i)}\right) - \sum _{i=1}^{d} \sum _{k=1}^{p_i}\sum _{l=1, l\ne k}^{p_i}{t_{kl}^{(i)} \theta Z_{k}^{i} } \\&\quad + \sum _{i=1}^{d} \sum _{k=1}^{p_i} {N_{k}^i}\log \left( t_{k}^{(i)}\right) - \sum _{i=1}^{d} \sum _{k=1}^{p_i} {t_{k}^{(i)} \theta Z_{k}^i } + \log (f_\Theta (\theta ; \mathbf \alpha ) ) \,. \end{aligned}$$

Regarding the E-step, which consists of computing the conditional expectation of the log-likelihood given the observed data, the calculations are somewhat similar to those of Albrecher et al. (2021a). We illustrate the procedure by computing the conditional expectation of the logarithmic term. Consider one (generic) data point (\(K = 1\)) and let \(\mathbf{y} = \mathbf{y} _1\). Then

$$\begin{aligned} \mathbb {E}\left[ \log (f_\Theta (\Theta ; \mathbf \alpha ) ) \mid \mathbf{Y} =\mathbf{y} \right]&= \int _0^\infty \log (f_\Theta (\theta ; \mathbf \alpha ) ) f_{\Theta | \mathbf{Y} } (\theta | \mathbf{y} ) d\theta \\&= \int _0^\infty \log (f_\Theta (\theta ; \mathbf \alpha ) ) \frac{f_{\Theta , \mathbf{Y} } (\theta , \mathbf{y} )}{f_\mathbf{Y }(\mathbf{y} )} d\theta \\&= \int _0^\infty \log (f_\Theta (\theta ; \mathbf \alpha ) ) \frac{f_{ \mathbf{Y} | \Theta } ( \mathbf{y} | \theta ) f_\Theta (\theta )}{f_\mathbf{Y }(\mathbf{y} )} d\theta \\&= \int _0^\infty \log (f_\Theta (\theta ; \mathbf \alpha ) ) \frac{ \prod _{i =1 }^{d} \varvec{\pi }_i \exp ({ \theta \varvec{T}_i y^{(i)}}) \theta \mathbf{t} _i }{f_\mathbf{Y }(\mathbf{y} )} f_\Theta (\theta ) d\theta \,. \end{aligned}$$

The formulas for all the other statistics are derived by similar calculations.

Concerning the M-step, consisting of maximizing the conditional expected log-likelihood in terms of the parameters, for the parameter \(\varvec{\alpha }\) of the scaling component we have in full generality

$$\begin{aligned} \hat{\mathbf \alpha }&= arg\,max_{\mathbf \alpha } \mathbb {E}\left( \log (f_\Theta (\Theta ; \mathbf \alpha ) ) \mid \tilde{\mathbf{Y }} = \tilde{\mathbf{y }} \right) \,. \end{aligned}$$

Regarding the PH component’s parameters, the entries of the sub-intensity matrix can be found by direct differentiation of the log-likelihood, while for the vector of initial probabilities, we can employ a Lagrange multiplier argument. We omit further details for brevity. We summarize the complete procedure in Algorithm 4.

figure d

4.3 A class of multivariate SIPH distributions

We now proceed to incorporate deterministic time-inhomogeneity into the shared scaling construction. Consider conditionally independent random variables \((Y_1,\dots ,Y_d)^{\top }\) given \(\Theta =\theta\) by

$$\begin{aligned} Y_i \mid \Theta = \theta \sim \text{ IPH }(\mathbf \pi _i,\varvec{T}_i, \theta \lambda _i) \,, \quad i = 1, \dots , d \,. \end{aligned}$$

Then

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} )&=\int {\mathbb P}(Y_1>y_1,\dots ,Y_d>y_d \mid \Theta =\theta )dF_\Theta (\theta ) \\&= \int \prod _{i=1}^d \mathbf \pi _i \exp \left( {\theta \varvec{T}_i h_i^{-1}(y_i)}\right) \mathbf{edF}_\Theta (\theta ) \\&= \int (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d ) \exp \left( {\theta (\varvec{T}_1 h_1^{-1}(y_1)\oplus \cdots \oplus \varvec{T}_d h_d^{-1}(y_d))}\right) \mathbf{e} dF_\Theta (\theta ) \\&= (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d ) \mathcal {L}_\Theta (-(\varvec{T}_1 h_1^{-1}(y_1)\oplus \cdots \oplus \varvec{T}_d h_d^{-1}(y_d))) \mathbf{e} , \,\,y_i\ge 0,\,\, i=1,\dots ,d, \end{aligned}$$

and

$$\begin{aligned} f_\mathbf{Y }(\mathbf{y} )&= \left( \prod _{i =1 } ^ d -\lambda _i(y_i) \right) (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d ) \mathcal {L}_{\Theta }^{(d)}(-(\varvec{T}_1 h_1^{-1}(y_1)\oplus \cdots \oplus \varvec{T}_d h_d^{-1}(y_d))) \tilde{\mathbf{t }} \,, \end{aligned}$$

where \(h^{-1}_i (y) = \int _0^{y} \lambda _i(t) dt\), \(i = 1, \dots , d\). Note that \(\mathbf{Y}\) has representation \((Y_1, \dots ,Y_d)^{\top } = (h_1(Z_1/\Theta ), \dots ,h_d(Z_d/\Theta ))^{\top }\), which can be seen as follows

$$\begin{aligned} {\mathbb P}(Y_1> y_1, \dots ,Y_d>y_d )&= \int {\mathbb P}(Y_1> y_1, \dots ,Y_d>y_d \mid \Theta = \theta ) dF_\Theta (\theta ) \\&= \int {\mathbb P}(Z_1> \theta h_1^{-1}(y_d), \dots ,Z_d>\theta h_d^{-1}(y_d) \mid \Theta = \theta ) dF_\Theta (\theta ) \\&= \int \prod _{i=1}^d \mathbf \pi _i \exp \left( {\theta \varvec{T}_i h_i^{-1}(y_i)}\right) \mathbf{e} dF_\Theta (\theta ) \\&= S_\mathbf{Y }(\mathbf{y} ) \,. \end{aligned}$$

Example 4.2

(Positive stable scaling). Take \(\Theta\) positive stable with stability parameter \(\alpha \in (0, 1]\), then

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} ) = (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d ) \exp \left( -(-\varvec{T}_1 h_1^{-1}(y_1)\oplus \cdots \oplus \varvec{T}_d h_d^{-1}(y_d))^{\alpha }\right) \mathbf{e} \,. \end{aligned}$$

For the particular case \(\lambda _i(y) \equiv \eta _i y^{\eta _i - 1}\), \(\eta _i>0\), \(i = 1, \dots , d\), we have

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} ) = (\mathbf \pi _1\otimes \cdots \otimes \mathbf \pi _d ) \exp \left( -(-\varvec{T}_1 y_1^{\eta _1}\oplus \cdots \oplus \varvec{T}_d y_d^{\eta _d})^{\alpha }\right) \mathbf{e} \,. \end{aligned}$$

This joint distribution can be seen to be a matrix-parameter version of the multivariate Weibull distribution introduced in Manatunga and Oakes (1999).

Remark 4.1

Covariates can be incorporated into the model by assuming that the intensities are of the form

$$\begin{aligned} \lambda _i(t ; \Theta , \mathbf{X} ) = \Theta \lambda _i(t) \exp (\mathbf \beta \mathbf{X} ), \quad t\ge 0, \quad i=1,\dots ,d. \end{aligned}$$

Remark 4.2

(Shared frailty model). In the shared frailty model, it is assumed that a group of individuals is conditionally independent given the frailty. In this way, the conditional joint survival function of \(\mathbf{Y} \mid \Theta =\theta\), \(\mathbf{Y} = (Y_1, \dots , Y_d)^{\top }\), is given by

$$\begin{aligned} S_\mathbf{Y | \Theta }(\mathbf{y} | \theta )&= {\mathbb P}(Y_1>y_1,\dots ,Y_d>y_d \mid \Theta = \theta ) \\&=\prod _{i = 1 }^{d} \exp (- \theta M_i(y_i)) \\&= \exp \left( -\theta \sum _{i =1}^{d} M_i(y_i) \right) \,, \end{aligned}$$

where \(M_i\) are baseline cumulative hazards, \(i = 1,\dots , d\). Thus, the joint survival function of \(\mathbf{Y}\) is given by

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} ) = \mathcal {L}_\Theta \left( \sum _{i =1}^{d} M_i(y_i) \right) \,. \end{aligned}$$

Using that

$$\begin{aligned} M_i(y) = \mathcal {L}_\Theta ^{-1}(S_{Y_i}(y)) \,, \quad i = 1, \dots , d , \end{aligned}$$

the above joint survival function can be rewritten as

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} ) = \mathcal {L}_\Theta \left( \sum _{i =1}^{d} \mathcal {L}_\Theta ^{-1}(S_{Y_i}(y_i)) \right) \,. \end{aligned}$$

In particular, this means that the survival copula of \(\mathbf{Y}\) is an Archimedean copula. Note that the shared frailty model is a particular case of the class of multivariate SIPH distributions introduced here when \(p = 1\).

We now study the dependence structure of multivariate SIPH distributions. When \(p = 1\), the survival copula of \(\mathbf{Y}\) is an Archimedean copula. To study the more general case, note that all the transformations presented in Table 1 are strictly increasing. This means that the copulas for models based on these intensities are the same as the ones of the models presented in Section 4.1, and thus it is enough to study the later case. Define the coefficient of upper tail dependence as

$$\begin{aligned} \lambda _U (\mathbf{Y} ) = \lim _{q \rightarrow 1^{-}} {{\mathbb P}(Y_1> F_{Y_1}^{\leftarrow }(q) \mid Y_2 > F_{Y_2}^{\leftarrow }(q)) } \,. \end{aligned}$$

Proposition 4.1

Let \(V := 1/ \Theta\) be regularly varying with index \(\alpha >0\). Then

$$\begin{aligned} \lambda _U (\mathbf{Y} )&= \Gamma (\alpha + 1) (\varvec{\pi }_1 \otimes \varvec{\pi }_2)(-\tilde{\varvec{T}}_1\oplus \tilde{\varvec{T}}_2 )^{-\alpha } \mathbf{e} \, , \end{aligned}$$

where \(\tilde{\varvec{T}}_i := \varvec{T}_i \mathbb {E}(Z_i^\alpha )^{1/\alpha }\), \(i =1,2\).

Proof

Given the definition of our model, Proposition 1 of Section 2 in Engelke et al. (2019) yields

$$\begin{aligned} \lambda _U (\mathbf{Y} ) = \mathbb {E}\left( \min \left( \frac{Z_1^\alpha }{\mathbb {E}(Z_1^\alpha )} , \frac{Z_2^\alpha }{\mathbb {E}(Z_2^\alpha )} \right) \right) \,, \end{aligned}$$

where \(Z_i\) are PH(\(\varvec{\pi }_i,\varvec{T}_i\)), and

$$\begin{aligned} \mathbb {E}(Z_i^\alpha ) = \Gamma (\alpha + 1) \varvec{\pi }_i(-{\varvec{T}}_i )^{-\alpha } \mathbf{e} \,, \quad i=1,2 \,. \end{aligned}$$

Moreover, \({Z_i}/{\mathbb {E}(Z_i^\alpha )^{1/\alpha }}\) is PH distributed with the same vector of initial probabilities \(\varvec{\pi }_i\) and sub-intensity matrix \(\tilde{\varvec{T}}_i = \varvec{T}_i \mathbb {E}(Z_i^\alpha )^{1/\alpha }\), \(i =1 ,2\). This implies that

$$\begin{aligned} \min \left( \frac{Z_1}{\mathbb {E}(Z_1^\alpha )^{1 / \alpha }} , \frac{Z_2}{\mathbb {E}(Z_2^\alpha )^{1 / \alpha }} \right) \sim \text{ PH } (\varvec{\pi }_1 \otimes \varvec{\pi }_2, \tilde{\varvec{T}}_1\oplus \tilde{\varvec{T}}_2) \,, \end{aligned}$$

which now yields

$$\begin{aligned} \lambda _U (\mathbf{Y} )&= \Gamma (\alpha + 1) (\varvec{\pi }_1 \otimes \varvec{\pi }_2)(-\tilde{\varvec{T}}_1\oplus \tilde{\varvec{T}}_2 )^{-\alpha } \mathbf{e} \,. \end{aligned}$$

Note that the resulting explicit expression for \(\lambda _U\) is in terms of the parameters of the PH components. For instance, when considering \(\Theta \sim \text{ Gamma }(\alpha ,1)\), the survival copula of the model can be different from the Clayton copula, for which \(\lambda _U = 2^{-\alpha }\). In Figure 1, we take the same value \(\alpha = 1\) and plot the implicit copula of two multivariate CPH distributions, one with upper tail dependence coefficient smaller than \(2^{-1}\) and the other larger than \(2^{-1}\), achieved solely by changing the parameters of the PH components.

Fig. 1
figure 1

Simulation of implicit copulas of multivariate SIPH with \(\lambda _U = 0.4128\) (left), and multivariate SIPH with \(\lambda _U = 0.5659\) (right)

4.4 Parameter estimation: multivariate SIPH distributions

If we assume that \(\lambda _i(\,\cdot \, ; \mathbf \eta _i)\) is a parametric function depending on the vector \(\mathbf \eta _i\), \(i =1 ,\dots , d\), and let \(\mathbf \eta = (\mathbf \eta _1, \dots , \mathbf \eta _d)\). Then we can use that \((h^{-1}_1(Y_1 ; \mathbf \eta _1), \dots , h^{-1}_d(Y_d;\mathbf \eta _d))^{\top } {\mathop {=}\limits ^{d}}(Z_1/\Theta , \dots ,Z_d/\Theta )^{\top }\) to formulate a generalized EM algorithm for maximum-likelihood estimation, which generalizes Algorithm 2 to the multivariate case.

figure e

5 Correlated scaling

We now extend the scaling of the sub-intensity matrix of SIPH distributions to the case where we condition on a random vector, the scaling factors being the components of such vector. We consider first the conditionally PH case, i.e. when no deterministic time-transform is present, and a scaling vector \(\mathbf \Theta = (\Theta _1, \dots ,\Theta _d)^{\top }\) and \(\mathbf{Y} = (Y_1,, \dots , Y_d)^{\top }\) such that the random variables \(Y_i\) are conditionally independent given \(\mathbf \Theta\) with laws

$$\begin{aligned} Y_i \mid \mathbf \Theta = (\theta _1, \dots ,\theta _d)^{\top } \sim \text{ PH }(\varvec{\pi }_i, \theta _i\varvec{T}_i)\,, \quad i = 1, \dots , d \,. \end{aligned}$$

Then, in full generality, the joint survival function of \(\mathbf{Y}\) is given by

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} ) = \int \prod _{i = 1}^{d} \varvec{\pi }_i \exp \left( {\theta _i \varvec{T}_i y_i}\right) \mathbf{e} dF_{\mathbf \Theta }(\varvec{\theta }),\quad y_i\ge 0,\,\,i=1,\dots ,d. \end{aligned}$$

Consider the bivariate case. Then, using functional calculus, we have that that the joint survival function takes the explicit form

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} )&= \int \varvec{\pi }_1 \exp \left( {\theta _1 \varvec{T}_1 y_1}\right) \mathbf{e} \varvec{\pi }_2 \exp \left( {\theta _2 \varvec{T}_2 y_2}\right) \mathbf{e} dF_{\mathbf \Theta }(\varvec{\theta }) \\&= \int (\varvec{\pi }_1 \otimes \varvec{\pi }_2) \exp \left( {\theta _1 \varvec{T}_1 y_1 \oplus \theta _2 \varvec{T}_2 y_2}\right) \mathbf{e} dF_{\mathbf \Theta }(\varvec{\theta }) \\&= \int (\varvec{\pi }_1 \otimes \varvec{\pi }_2) \exp \left( {\theta _1 \varvec{T}_1 y_1 \otimes \mathbf{I} _2 + \mathbf{I} _1 \otimes \theta _2 \varvec{T}_2 y_2}\right) \mathbf{e} dF_{\mathbf \Theta }(\varvec{\theta }) \\&= (\varvec{\pi }_1 \otimes \varvec{\pi }_2) \mathcal {L}_{\varvec{\Theta }}(-\varvec{T}_1 y_1 \otimes \mathbf{I} _2 , -\mathbf{I} _1 \otimes \varvec{T}_2 y_2) \mathbf{e} , \quad y_1,y_2\ge 0, \end{aligned}$$

where \(\mathcal {L}_{\mathbf \Theta }\) is the joint Laplace transform of \(\mathbf \Theta\), that is

$$\begin{aligned} \mathcal {L}_{\mathbf \Theta } (u_1,u_2 ) = \mathbb {E}\left( \exp \left( {-u_1 \Theta _1 - u_2 \Theta _2 }\right) \right) ,\quad u_1,\,u_2\ge 0. \end{aligned}$$

Note that \(\mathbf{Y} = (Z_1 /\Theta _1, \dots , Z_d /\Theta _d)^{\top }\), where \(Z_i\) are independent \(\text{ PH }(\varvec{\pi }_i, \varvec{T}_i)\) distributed random variables, \(i = 1, \dots , d\), independent of \(\varvec{\Theta }\). Indeed,

$$\begin{aligned} {\mathbb P}\left( Y_1> y_1, \dots , Y_d> y_d \right)&= \int {\mathbb P}\left( Y_1> y_1, \dots , Y_d> y_d \mid \varvec{\Theta }\right) dF_{\mathbf \Theta }(\varvec{\theta }) \\&= \int {\mathbb P}\left( Z_1> \theta _1 y_1, \dots , Z_d > \theta _d y_d \mid \varvec{\Theta }\right) dF_{\mathbf \Theta }(\varvec{\theta }) \\&= \int \prod _{i = 1}^{d} \varvec{\pi }_i \exp \left( {\theta _i \varvec{T}_i y_i}\right) \mathbf{e} dF_{\mathbf \Theta }(\varvec{\theta }) \\&= S_\mathbf{Y }(\mathbf{y} ) \,. \end{aligned}$$

5.1 Parameter estimation: correlated CPH distributions

The maximum-likelihood estimation of this class of multivariate distributions can be performed via a generalized EM algorithm. The derivation is done similarly to Algorithm 4 and thus omitted for brevity. Again, for estimation, we assume that \(\varvec{\Theta }\) belongs to a parametric family depending on the vector \(\varvec{\alpha }\) and denote by \(f_{\varvec{\Theta }}\) its corresponding joint density. The resulting detailed routine is provided in Algorithm 6.

figure f

Remark 5.1

This algorithm suffers from the curse of dimensionality. The integrals above must typically be computed numerically, given that explicit expressions are not available. Thus, the number of summands needed for the approximation increases rapidly with the dimension. It is also important to mention that correlated frailty models are typically employed only in the bivariate case. In such a case, the above algorithm is computationally feasible, thus its relevance.

5.2 Correlated SIPH distributions

We now introduce an analogous model to the correlated frailty model based on IPH distributions, effectively the most general of our models. Consider a multivariate random scaling component \(\mathbf \Theta = (\Theta _1, \dots ,\Theta _d)^{\top }\) and \(\mathbf{Y} = (Y_1, \dots ,Y_d)^{\top }\), both in in \(\mathbb {R}_+^d\), such that \(Y_i\) are conditionally independent given \(\mathbf \Theta\) with conditional distribution

$$\begin{aligned} Y_i \mid \mathbf \Theta = (\theta _1, \dots , \theta _d)^{\top } \sim \text{ IPH }(\varvec{\pi }_i, \varvec{T}_i, \theta _i \lambda _i)\,, \quad i = 1, \dots , d \, . \end{aligned}$$

The joint survival function of \(\mathbf{Y}\) is then given by

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} ) = \int \prod _{i = 1}^{d} \varvec{\pi }_i \exp \left( {\theta _i \varvec{T}_i h_{i}^{-1}(y_i)}\right) \mathbf{e} dF_{\mathbf \Theta }(\varvec{\theta }), \quad y_i\ge 0,\,\,i=1,\dots ,d. \end{aligned}$$

In the bivariate case, we have by simple calculations (using functional calculus) the explicit expression

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} )&= (\varvec{\pi }_1 \otimes \varvec{\pi }_2) \mathcal {L}_{\varvec{\Theta }}(-\varvec{T}_1 h_1^{-1}(y_1) \otimes \mathbf{I} _2 , -\mathbf{I} _1 \otimes \varvec{T}_2 h_2^{-1}(y_2) ) \mathbf{e} , \quad y_1,\,y_2\ge 0. \end{aligned}$$

Note that an alternative representation for \(\mathbf{Y}\) is \(\mathbf{Y} = ( h_1 (Z_1 / \Theta _1), \dots , h_d (Z_d / \Theta _d))^{\top }\), where \(Z_i\) are independent PH distributed random variables independent of \(\varvec{\Theta }\). The proof is akin to those of previous sections.

Now we consider a specific example with explicit joint density, namely the correlated Gamma case.

Example 5.1

(Correlated Gamma scaling). Inspired by Yashin et al. (1995), we consider \(\mathbf \Theta = (\Theta _1, \Theta _2)^{\top }\) such that

$$\begin{aligned}&\Theta _1 = \frac{\eta _0}{\eta _1} W_0 + W_1 \\&\Theta _2 = \frac{\eta _0}{\eta _2} W_0 + W_2 \,, \end{aligned}$$

where \(W_i \sim \text{ Gamma }(\kappa _i, \eta _i)\), \(\kappa _i,\eta _i>0\), \(i = 0,1,2\), are independent. Then we have that

$$\begin{aligned}&\mathbb {E}\left( \exp (- u_1 \Theta _1 -u_2 \Theta _2 )\right) \\&\quad = \mathbb {E}\left( \exp \left( - \left( u_1 \frac{\eta _0}{\eta _1} + u_2 \frac{\eta _0}{\eta _2} \right) W_0 -u_1 W_1 -u_2 W_2 \right) \right) \\&\quad = \left( 1 + \left( \frac{u_1}{\eta _1} + \frac{u_2}{\eta _2}\right) \right) ^{-\kappa _0} \left( 1 + \frac{u_1}{\eta _1} \right) ^{-\kappa _1} \left( 1 + \frac{u_2}{\eta _2} \right) ^{-\kappa _2}, \,\, u_1,\,u_2\ge 0. \end{aligned}$$

This yields

$$\begin{aligned} S_\mathbf{Y } (y_1, y_2)&= (\varvec{\pi }_1 \otimes \varvec{\pi }_2 ) \left( \mathbf{I} -\left( \frac{h^{-1}_1(y_1)}{\eta _1} \varvec{T}_1\right) \oplus \left( \frac{h^{-1}_2(y_2)}{\eta _2} \varvec{T}_2\right) \right) ^{-\kappa _0} \\&\quad \cdot \left( \mathbf{I} - \left( \frac{h^{-1}_1(y_1)}{\eta _1}\varvec{T}_1\right) \otimes \mathbf{I} _2\right) ^{-\kappa _1} \left( \mathbf{I} - \mathbf{I} _1 \otimes \left( \frac{h^{-1}_2(y_2)}{\eta _2} \varvec{T}_2\right) \right) ^{-\kappa _2} \mathbf{e} \,. \end{aligned}$$

One typically sets \(\eta _1 = \kappa _0 + \kappa _1\) and \(\eta _2 = \kappa _0 + \kappa _2\). In this way \(\mathbb {E}(\Theta _1) = \mathbb {E}(\Theta _2) = 1\), \(\text{ Var }(\Theta _1) = \eta _1^{-1}\), \(\text{ Var }(\Theta _2) = \eta _2^{-1}\) and \(\text{ Corr }(\Theta _1, \Theta _2) = \kappa _0 / \sqrt{(\kappa _0 + \kappa _1 )(\kappa _0 + \kappa _2)}\).

Remark 5.2

(Estimation). Maximum-likelihood estimation can be performed via a modified EM algorithm, which is in the same form as Algorithm 5 with the only change in step 1, where we now employ Algorithm 4. We omit further details.

Remark 5.3

(Correlated frailty). The correlated frailty model assumes that the frailties of individuals are correlated and not necessarily shared. More specifically, in a bivariate correlated frailty model, the conditional joint density of \(\mathbf{Y} \mid \varvec{\Theta }= \varvec{\theta }\) is

$$\begin{aligned} S_\mathbf{Y | \varvec{\Theta }}(\mathbf{y} | \varvec{\theta }) = \exp (-\theta _1 M_1(y_1)) \exp (-\theta _2 M_2(y_2)) \,. \end{aligned}$$

In this way, the joint survival function of \(\mathbf{Y}\) is given by

$$\begin{aligned} S_\mathbf{Y }(\mathbf{y} ) = \mathcal {L}_{\varvec{\Theta }} (M_1(y_1), M_2(y_2)) \,. \end{aligned}$$

This is indeed a particular case of the correlated intensities model introduced in the present section when \(p = 1\).

6 Numerical illustrations

In this section, we present some numerical illustrations of practical relevance. In the first example, we test the performance of Algorithm 3 for the estimation of matrix Mittag-Leffler distributions in a simulation study. In the second example, we consider the fitting of a SIPH distribution to a theoretical given distribution. In the third example, we fit a SIPH to a real-life insurance data set. As a final example, we perform a simulation study for a multivariate CPH distribution. In all cases, we ran the generalized EM algorithms until the changes in the successive log-likelihoods became negligible.

6.1 Matrix mittag-leffler distributions

We generated an i.i.d. sample of size 1, 000 from a matrix Mittag-Leffler distribution of 4 phases with parameters

$$\begin{aligned}\begin{gathered} {\varvec{\pi }}=\left( 0.2, \,0.8,\, 0,\, 0\right) \,, \\ \mathbf{T}=\left( \begin{array}{cccc} -2 &{} 0 &{} 2 &{} 0 \\ 5 &{} -8 &{} 0 &{} 3 \\ 0 &{} 0 &{} -1 &{} 0.5 \\ 0 &{} 0 &{} 0 &{} -4 \end{array} \right) \,, \\ \alpha = 0.8 \,. \end{gathered}\end{aligned}$$

We then fitted a matrix Mittag-Leffler distribution with the same number of phases to the resulting sample using Algorithm 3, obtaining the following parameters:

$$\begin{aligned}\begin{gathered} \hat{\varvec{\pi }}=\left( 0, \,0.0381,\, 0.8481, \, 0.1139 \right) \,, \\ \hat{\varvec{T}}=\left( \begin{array}{cccc} -3.4286 &{} 0.1942 &{} 0.0495 &{} 0.5393\\ 0.6080 &{} -1.2013 &{} 0.0184 &{} 0.0084 \\ 2.4001 &{} 2.1178 &{} -4.7794 &{} 0.2615 \\ 0.3800 &{} 0.2744 &{} 0.3870 &{} -1.0648 \end{array} \right) \,, \\ \hat{\alpha }= 0.7928\,. \end{gathered}\end{aligned}$$

Observe that we can somewhat retrieve the parameters by keeping in mind possible permutation of states (since their labels are not relevant). Figure 2 shows that the algorithm recovers the structure of the data. Moreover, note that \(\hat{\alpha }= 0.7928\), which determines the heaviness of the tail, is close to the original value \(\alpha = 0.8\). As further evidence of the quality of the fit, we have that the log-likelihood of the fitted model is \(-1,769.596\), while using the original distribution parameters and structure, we obtain \(-1,773.453\).

Fig. 2
figure 2

Histogram of log-simulated data versus density of the fitted matrix Mittag-Leffler model (left), and corresponding QQ-plot (right)

6.2 Matrix-Weibull

Algorithm 2 can be easily modified to approximate given theoretical distributions. As in the PH case (Asmussen et al. (1996)), the idea consists of considering sequences of empirical distributions with increasing sample size. For instance, if we denote by g the theoretical given density that we want to approximate, in step 1, we have that as \(K\rightarrow \infty\),

$$\begin{aligned} \hat{\pi }_k&= \frac{1}{K} \sum _{n = 1}^{K} \int \frac{\pi _k \mathbf{e} _k^\top \exp ({\theta \varvec{T}h^{-1}(y_n)}) \theta \mathbf{t} }{\varvec{\pi }\exp ({\theta \varvec{T}h^{-1}(y_n)}) \theta \mathbf{t} } f_\Theta (\theta ) d\theta \\&\rightarrow \int \int \frac{\pi _k \mathbf{e} _k^\top \exp ({\theta \varvec{T}h^{-1}(y)}) \theta \mathbf{t} }{\varvec{\pi }\exp ({\theta \varvec{T}h^{-1}(y)})\theta \mathbf{t} } f_\Theta (\theta ) d\theta g(y) dy \,. \end{aligned}$$

The rest of the formulas in step 1 are adapted through the same limit. Regarding step 2, we have

$$\begin{aligned} \hat{\mathbf \eta }&\rightarrow arg\,max_{\mathbf \eta } \int \log (f_{Y}(y; \hat{\varvec{\pi }}, \hat{\varvec{T}}, \hat{\mathbf \alpha }, \mathbf \eta )) g(y)dy \,. \end{aligned}$$

As a concrete example, we consider a Matrix-Weibull distribution (as introduced in Albrecher and Bladt (2019), having no random scaling component) with density function

$$\begin{aligned} g(y) = \varvec{\pi }\exp (\mathbf{S} y^{\beta }) \mathbf{s} \beta y^{\beta -1} \,, \quad y>0 \,, \end{aligned}$$

and parameters

$$\begin{aligned}\begin{gathered} {\varvec{\pi }}=\left( 0.5, \,0.3,\, 0.2 \right) \,, \\ \mathbf{S}=\left( \begin{array}{ccc} -1 &{} 1 &{} 0 \\ 0 &{} -2 &{} 1 \\ 0 &{} 0 &{} -5 \end{array} \right) \,, \\ {\beta }= 2 \,. \end{gathered}\end{aligned}$$

Then we fitted a SIPH distribution of 3 phases with baseline intensity \(\lambda (y) = \eta y^{\eta -1}\), \(\eta >0\), and positive stable scaling. The fitted parameters are the following

$$\begin{aligned}\begin{gathered} \hat{\varvec{\pi }}=\left( 0.1876, \,0.3037,\, 0.5086 \right) \,, \\ \hat{\varvec{T}}=\left( \begin{array}{ccc} -1.9843 &{} 1.2605 &{} 0.5706\\ 0.0133 &{} -1.2985 &{} 0.1584 \\ 2.3573 &{} 0.9338 &{} -5.2052 \end{array} \right) \,, \\ \hat{\alpha }= 0.9146\,, \quad \hat{\eta }= 2.1723 \,. \end{gathered}\end{aligned}$$

The quality of the approximation is supported by Figure 3, which shows that we recover the shape of the original distribution. Moreover, the product \(\hat{\alpha }\hat{\eta } = 1.9867\), which determines the heaviness of the tail, can be compared with \(\beta = 2\) for the given theoretical model.

Fig. 3
figure 3

Density of the original matrix-Weibull versus density of the fitted SIPH (left), and corresponding QQ-plot (right)

6.3 Real-life data

The Gamma-Gompertz frailty model is commonly employed for modeling human mortality at old ages (see, e.g., Missov (2013); Vaupel et al. (1979)). In the present example, we propose using SIPH distributions with Gamma scaling and Gompertz baseline intensity for modeling this type of data.

As a concrete case of study, we consider the lifetimes of the Swedish population that die in the year 2011 between ages 50-100. This data was obtained from the Human Mortality Database (HMD). We add covariate information by considering a separation between females (\(X=1\)) and males (\(X=0\)) in the population. Then we fitted a SIPH distribution of 4 phases with general Coxian structure in the PH component. The estimated parameters are

$$\begin{aligned}\begin{gathered} \hat{\varvec{\pi }}=\left( 0.2097, \,0.1572,\, 0.3135,\,0.3196 \right) \,, \\ \hat{\varvec{T}}=\left( \begin{array}{cccc} -0.0022 &{} 0.0004 &{} 0 &{} 0 \\ 0 &{} -1.1003 &{} 1.1003 &{} 0 \\ 0 &{} 0 &{} -0.6730 &{} 0.6730\\ 0 &{} 0 &{} 0 &{} 0.0001 \end{array} \right) \,, \\ \hat{\alpha }= 5.803\,, \quad \hat{\eta }= 0.1663 \,, \quad \hat{\beta }= -0.5389 \,. \end{gathered}\end{aligned}$$

Figure 4 shows that the fitted distribution provides a reasonable model for both groups. If an even closer fit is sought, other parameters of the model need to be regressed as well.

Fig. 4
figure 4

Histogram of lifetimes of the Swedish female population that died in 2011 at ages 50 to 100 versus density of the fitted SIPH (left), and corresponding plot for the male population(right)

6.4 Multivariate example

We generated an i.i.d. sample of size 2, 500 from a bivariate CPH distribution with parameters

$$\begin{aligned}\begin{gathered} {\varvec{\pi }_1}=\left( 1, \,0,\, 0 \right) \,, \\ {\varvec{T}_1}=\left( \begin{array}{ccc} -0.5 &{} 0.2 &{} 0 \\ 0 &{} -1 &{} 0.5 \\ 0 &{} 0 &{} -2 \end{array} \right) \,, \\ {\varvec{\pi }_2}=\left( 0.5, \,0.5 \right) \,, \\ {\varvec{T}_2}=\left( \begin{array}{cc} -0.1 &{} 0 \\ 0 &{} -1 \end{array} \right) \,, \end{gathered}\end{aligned}$$

and Gamma scaling with \(\alpha = 1.5\). Note that the upper tail dependence coefficient of the theoretical model is \(\lambda _U = 0.2765\), while the empirical estimator of the sample is \(\hat{\lambda }_U = 0.28\). Then we fitted a bivariate CPH model of same dimensions using Algorithm 4 obtaining the parameters

$$\begin{aligned}\begin{gathered} \hat{\varvec{\pi }_1}=\left( 0.3268, \,0.2124,\, 0.4608\right) \,, \\ \hat{\varvec{T}_1}=\left( \begin{array}{ccc} -2.0252 &{} 1.0067 &{} 0.9015 \\ 0.0334 &{} -1.0061 &{} 0.3753 \\ 0.9293 &{} 0.6818 &{} -1.7945 \end{array} \right) \,, \\ \hat{\varvec{\pi }_2}=\left( 0.884, \,0.116 \right) \,, \\ \hat{\varvec{T}_2}=\left( \begin{array}{cc} -0.8978 &{} 0.3046 \\ 0.1501 &{} -0.1546 \end{array} \right) \,, \\ \hat{\alpha } = 1.5874 \,. \end{gathered}\end{aligned}$$

Figure 5 shows that we recover the structure of both marginals. Regarding the dependence structure, this is supported by Figure 6, where we offer some contour plots. Moreover, note that the parameter \(\alpha\) that determines the heaviness of the tails of the marginals is close to the original model and that the coefficient of upper tail dependence \(\lambda _U = 0.254\) is close to the original (and sample) one. Finally, note that the original model’s log-likelihood is \(-11,753.27\), compared with \(-11,752.45\) for the fitted model.

Fig. 5
figure 5

Histograms of log-simulated data versus densities of the fitted distribution

Fig. 6
figure 6

Contour plot of the sample (left), contour plot of original distribution (center), and contour plot of fitted distribution (right)

7 Conclusion

We have provided a phase-type-based model which can result in non-exponential tail behavior by introducing random and deterministic transformations. The resulting model is generally tractable in terms of matrix calculus through the Laplace transform of the random component, and thus closed-form formulas allow for statistical and probabilistic treatments, for instance, for fully explicit generalized EM algorithms. In the univariate case, the current three main ways of generating heavy-tailed phase-type distributions fall into our framework, and several new models are introduced to complement the existing suite of hidden Markov models. In the multivariate case, we obtain generalizations of well-known frailty models with fully explicit densities, contrary to other approaches of multivariate phase-type distributions in the literature (in terms of rewards or copulas). We finally show the feasibility of the statistical implementation of our models using four different examples.

Heavy-tailed phase-type distributions are statistically attractive since their interpretation in terms of an underlying evolving process is natural in many domains of application which involve processes that traverse numerous states through time, for instance, human lifetimes or legal cases. With the models and algorithms provided in this paper, we aim to provide a clearer picture of the possibilities and limitations of Markov models for practitioners that require non-standard but interpretable models. A promising further direction of research for generating uni- and multivariate scaled phase-type distributions is to consider a general stochastic process as time-change, which for certain choices may provide fully explicit functionals and estimation procedures while remaining conceptually simple.