Heavy-tailed phase-type distributions: a unified approach

A phase-type distribution is the distribution of the time until absorption in a finite state-space time-homogeneous Markov jump process, with one absorbing state and the rest being transient. These distributions are mathematically tractable and conceptually attractive to model physical phenomena due to their interpretation in terms of a hidden Markov structure. Three recent extensions of regular phase-type distributions give rise to models which allow for heavy tails: discrete- or continuous-scaling; fractional-time semi-Markov extensions; and inhomogeneous time-change of the underlying Markov process. In this paper, we present a unifying theory for heavy-tailed phase-type distributions for which all three approaches are particular cases. Our main objective is to provide useful models for heavy-tailed phase-type distributions, but any other tail behavior is also captured by our specification. We provide relevant new examples and also show how existing approaches are naturally embedded. Subsequently, two multivariate extensions are presented, inspired by the univariate construction which can be considered as a matrix version of a frailty model. We provide fully explicit EM-algorithms for all models and illustrate them using synthetic and real-life data.


Introduction
Phase-type (PH) distributions have been employed extensively in applied probability since they often provide exact and explicit solutions to complex stochastic problems. Another attractive property of PH distributions is that they form a dense class in the set of distributions in the positive half-line in the sense of weak convergence (see [Section 3.2.1] Bladt and Nielsen (2017)). However, and despite their denseness, PH distributions are always light-tailed, which may be a problem when heavy tails are present.
At least three approaches to remedy this problem have been introduced in the literature. The first one, originally introduced in Bladt et al. (2015) and called the NPH class of distributions, consists of considering PH distributions scaled by nonnegative discrete random variables, N. This construction principle has the advantage that the resulting distribution maintains the interpretation as being the absorption time of a homogeneous Markov jump process but in an infinite-dimensional state-space. This, indeed, allows for genuinely heavy tails for the resulting distribution. For instance, in Rojas-Nandayapa and Xie (2018), the authors showed that if the scaling component is unbounded (but otherwise arbitrary), then the resulting distribution is always heavy-tailed in terms of non-existent moment generating functions (see also Su and Chen (2006) for more general results). However, their different functionals are in terms of infinite-dimensional matrices, which in practice, can only be computed up to a finite number of terms. More recently, in Albrecher et al. (2021a), the authors considered continuous scaling and showed that closed-form expressions for different functionals of the resulting distributions can be obtained. They denoted this class by CPH. Another advantage of continuous scaling is that it maintains the (finite) dimension of the underlying PH.
A second approach was introduced in Albrecher et al. (2020a) by considering a time-fractional version of the underlying stochastic process dynamics, effectively moving into the semi-Markov domain. Together with subsequent multivariate extensions based on rewards (cf. Albrecher et al. ( , 2020b), these models were shown to be feasible models for applications such as non-life insurance modeling. More recently,  showed that these models are relevant in describing lifetimes and performing the corresponding life-insurance calculations.
The third approach, introduced in Albrecher and Bladt (2019), consists of allowing the Markov jump process to be time-inhomogeneous in the construction principle of PH distributions leading to the class of inhomogeneous phase-type (IPH) distributions. An advantage of this approach is that one gains substantial flexibility on the tails: not only are heavy tails possible but also, e.g., lighter tails than exponential-decay can be obtained. Further extensions to covariate-dependent distributions can be found in Albrecher et al. (2021b), which is particularly well-suited for survival analysis applications.
Estimation of PH distributions was initially developed to calibrate such stochastic models to real-life data, and it is a well-developed topic in the literature. It is typically done via an expectation-maximization (EM) algorithm (Asmussen et al. (1996)), although other methods such as an MCMC approach have been introduced (Bladt et al. (2003)). More recent trends have moved towards considering PH-based models purely as flexible models for statistical fitting, irrespectively of their explicit and closed-form formulas. This data-driven approach is particularly attractive compared to other classical alternatives (for instance, kernel smoothing) since there is the implicit interpretation of an underlying process traversing through different states before it terminates, which is easy to justify in many application areas. Algorithms for discretely-scaled PH distributions, IPH models, and continuously-scaled PH distributions can be found, respectively, in Bladt and Rojas-Nandayapa (2018), , and Albrecher et al. (2021a). To the best of the authors' knowledge, an EM-based estimation procedure for fractional phase-type distributions (also called matrix Mittag-Leffler distributions) has not been considered before the present work, with Albrecher et al. (2020a) performing a purely numerical multi-dimensional maximum-likelihood estimation.
The primary purpose of this paper is to present a unified theory that englobes the above approaches to produce heavy-tailed phase-type distributions. The construction principle of the proposed models is simple to conceptualize and can be seen as a matrix extension of the frailty model in survival analysis. However, the flexibility of the underlying Markov structure allows for very different objects to be constructed as special cases. More precisely, we study IPH distributions with intensity matrices scaled by any nonnegative random variable. In other words, we impose both a random and a deterministic component which modify the speed at which the finite state-space is traversed by the Markov process, such that absorption times can possess any desired tail and body behavior, in particular obtaining heavy-tailed distributions. Inhomogeneous generalizations of Albrecher et al. (2021a); Rojas-Nandayapa and Xie (2018), the matrix Mittag-Leffler models of Albrecher et al. (2020a), and randomly scaled generalizations of Albrecher and Bladt (2019); Albrecher et al. (2021b) (with the possibility of missing covariates) are all comprised in this rich class.
In terms of physical interpretation, the latent variables play different roles. The underlying Markov dynamics aim to model heterogeneity by assuming that unobserved traversing of states has occurred. In contrast, the interpretation of the scaling component is closely related to the statistical concept of frailty. Recall that frailty models (see, e.g., Wienke (2010) for a comprehensive account of such models) specify a multiplicative random effect on the hazard rate of a distribution, effectively accounting for unobserved covariates in a Cox proportional hazards model. In contrast, we specify a multiplicative random effect on the intensity function of a Markov jump process. Nonetheless, since for IPH distributions, the hazard rate and intensity function are asymptotically equivalent (cf. Albrecher et al. (2021b)), the scaling variable can also be interpreted as accounting for heterogeneity or missing covariates in an asymptotically proportional hazards model.
The secondary aim of the paper is to present multivariate models based on this construction, which can be interpreted as generalizations of the shared and correlated frailty models (cf. Wienke (2010)). We derive EM algorithms for maximumlikelihood estimation of all the proposed models, which can be implemented either in full generality or by simplifying some assumptions and tailoring the methods for the specific application. For pedagogical reasons, we build up the multivariate case from the univariate one, although a top-bottom approach is also possible.
The rest of the paper is organized as follows. In Section 2, we present an overview of the class of IPH distributions and some important properties for our present purposes. In Section 3, we introduce our main univariate model, which we call scaled inhomogeneous phase-type, derive its main properties, give several parametric examples relevant for real-life applications, and propose a generalized EM algorithm for its maximum-likelihood estimation. In Section 4, we present a multivariate extension inspired by the shared frailty model and show how estimation of the proposed models can be performed via EM algorithms. In Section 5, we present a different multivariate extension, now based on the construction principle of correlated frailty models, and derive an EM algorithm for maximum-likelihood estimation. In Section 6, we present several numerical illustrations. Finally, Section 7 concludes.

Preliminaries
This section presents the relevant preliminaries on time-inhomogeneous Markov jump-processes and their absorption times. The distributions of the latter times are the building blocks for the scaled models introduced in Section 3. For distributional equality between two random variables X, Y, we use the notation X d =Y , while the notation X ∼ F for F a distribution function, density, or acronym is understood as X following the distribution uniquely associated with F. Unless stated otherwise, equalities between random objects hold almost surely. For two real-valued functions, g, h the terminology g(t) ∼ h(t) , as t → a ∈ ℝ ∪ {−∞, +∞} is defined as lim t→a g(t)∕h(t) = 1 . If a is not explicitly mentioned, it is assumed to be +∞.
Let (X t ) t≥0 denote a time-inhomogeneous Markov jump process on the state-space E = {1, … , p, p + 1} , where states 1, … , p are transient and state p + 1 is absorbing. In this way, (X t ) t≥0 has an intensity matrix of the form Since (t) is an intensity matrix, the sum of its rows is zero for any time t ≥ 0 , and so the identity (t) = −T(t) , holds, where is the p-dimensional column vector of ones. Moreover, the probability transition matrix P(s, t) = {p k,l (s, t)} k,l∈E of (X t ) t≥0 , where is given in terms of the product integral (see Albrecher and Bladt (2019)) To avoid degeneracies, we assume that the process starts almost surely in a nonabsorbing state k ≤ p with probabilities given by k = ℙ(X 0 = k) , k = 1, … , p . In vector notation, we write = ( 1 , … , p ) . In the sequel, we follow the convention that greek boldface lowercase letters are row-vectors, while roman boldface lowercase letters are column-vectors. Thus ∑ p k=1 k = = 1. The main quantity of interest of such a process for our present purposes is the time taken to reach the absorbing state, denoted by which has an inhomogeneous phase-type distribution (cf. Albrecher and Bladt (2019)) with representation ( , T(t)) , and we write ∼ IPH ( , T(t)) . Application of such random variables to statistical modeling is often treated for the special case T(t) = (t) T , with (t) some known nonnegative real function, known as the intensity function, and T a fixed sub-intensity matrix. We adopt this approach in the present text. Thus we may simply write ∼ IPH ( , T, ) . The interested reader is referred to Bladt and Nielsen (2017) for a comprehensive account of the ≡ 1 case and Albrecher and Bladt (2019) for further reading on general IPH distributions.
The restricted class of IPH distributions is nonetheless quite versatile. Whenever Y ∼ IPH ( , T, ) , then there exists a function h such that where Z ∼ PH ( , T) . More specifically, the relationship between h and is given by or in terms of derivatives To make sure that Y is positive, unbounded, and almost surely finite, we require that The density f Y and survival function S Y of Y ∼ IPH ( , T, ) are explicit in terms of matrix exponential formulas, and given by The tail behavior of IPH distributions is driven by the asymptotic behavior of the function. Table 1 presents an overview of some commonly used intensities and transforms for generating parametric IPH distributions (see Bladt and Yslas (2021)). Applications and estimation can be found, for instance, in Albrecher and Bladt (2019); Albrecher et al. ( , 2021b. Their names are inspired by the p = 1 case, e.g., a matrix-Weibull distribution reduces to the regular Weibull distribution when T is a 1 × 1 matrix. In general, the additional parameters allow for more flexible modeling in the body of the distribution while preserving the same tail behavior as the scalar case.

Scaled inhomogeneous phase-type distributions
In this section, we introduce the main general specification of the paper and then derive some special cases together with a detailed analysis of their specific tail asymptotics. The central assumption underpinning our model is that an individual's intensity function depends on an unobservable nonnegative random variable Θ . More specifically, we focus on the case where Θ acts multiplicatively on the intensity function, that is where is the baseline intensity function. If we denote by Y a random variable with intensity (2), then we have that For the representation of these distributions, we make use of functional calculus. More specifically, if g is an analytic function and A is a matrix, we can express g(A) by Cauchy's formula where Γ is the simple closed path in ℂ which encloses the eigenvalues of A (cf. [Section 3.4.] Bladt and Nielsen (2017) for details).
The following result characterizes the density and survival functions of Y. In particular, observe that the asymptotic behavior of the tail of Y depends on both the shape of L Θ , the Laplace transform of Θ , and on . In subsection 3.1, we give an in-depth asymptotic analysis of the new parametric models presented in this paper.
(2) (t;Θ) = Θ (t), t ≥ 0, Proposition 3.1 Let Y be given by (3). Then we have that, for y ≥ 0 , where we have used functional calculus to define the Laplace transform evaluated at a matrix. Taking derivatives in the expression above yields from which (2) follows.
The following lemma shows that Y has the same distribution as the transformation of a scaled PH distribution. Such a representation is useful for simulation and for estimation, as is apparent in later sections.

Lemma 3.1 Let Y be given in terms of
Proof We now make the following formal definition of a random variable Y satisfying (3).

Definition 3.1 A random variable Y is said to have scaled inhomogeneous phasetype distribution (SIPH) with representation ( , T, ) and scaling distribution F Θ if its survival function is given by
We write SIPH ( , T, , Θ). where Z ∼ PH ( , T) and S is an independent (positive stable) random variable with Laplace transform given by exp(−u ) , ∈ (0, 1] . Hence, we have that Y is SIPH distributed with h(x) = x 1∕ and Θ = 1∕S . This class of distributions is the time-fractional counterpart of PH distributions and can be interpreted as absorption times of a stochastic process that traverses through a finite number of states. The holding times of the latter are Mittag-Leffler distributed, which are regularly varying, and thus can possess abnormally large holding times compared to a Markov framework. However, the boundary case = 1 corresponds to the usual exponential holding times, and thus there is a regime-shift with respect to tail behavior. iii) When the scaling component Θ degenerates to a point Θ ≡ k ∈ ℝ + , we recover the class of IPH distributions. This also implies that the class of SIPH distributions, with a given and fixed intensity, is dense in the class of distributions on the positive real line. The argument is omitted, but it is a simple application of convergence through the diagonal of an array, for instance, by choosing a sequence of scalings Θ n with constant mean k and variances shrinking to zero.

Remark 3.2
Recall that for a continuous and positive random variable Y, the hazard function Y is given by Sometimes, it is convenient to deal with the cumulative or integrated hazard function M Y , which is given by The classical frailty model in survival analysis assumes that the hazard function of an individual depends on an unobservable random variable Θ . More specifically, it assumes that Θ acts multiplicatively on a baseline hazard function , that is Here, the random variable Θ is known as the frailty. If we denote by Y the random variable with the above hazard, then the survival function of Y | Θ = is given by Thus, the unconditional survival function of Y is given by Furthermore, model (4) can incorporate covariates = (X 1 , … , X q ) ⊤ ∈ ℝ q in a similar way to the Cox's proportional hazards model via where ∈ ℝ q is a q-dimensional parameter row vector. Note that when the frailty degenerates to Θ ≡ 1 , one recovers the proportional hazards model, meaning that the frailty model generalizes the proportional hazards model. Commonly employed distributions as frailties include the Gamma and the positive stable distributions, among others.
In Albrecher et al. (2021b), it was shown that the intensity function of an IPH distribution is asymptotically equivalent to its hazard function. More specifically, we have that (t) ∼ C (t) as t → ∞ with C > 0, a positive constant. In particular, when p = 1 , the previous asymptotic result becomes equality. It follows that the frailty model is a special case of our more general matrix specification of SIPH distributions, when p = 1.

Remark 3.3 (Incorporating regressors). As in the frailty model, we can introduce covariates into (2) via
In this case, we write Y ∼ SIPH ( , T, , Θ, ) to denote a random variable with above intensity. Note that the proportional intensities model introduced in Albrecher et al. (2021b) is retrieved if the scaling distribution degenerates to Θ ≡ 1 for all individuals. Consequently, the SIPH model is a generalization of the proportional intensities model.
In what follows, we mostly restrict ourselves to the model (2) without covariates, the extension being straightforward but somewhat distracting to the current train of thought. Moreover, we assume that Θ is a continuous random variable unless stated otherwise.

Novel examples
Next, we present a suite of new examples that arise naturally as matrix extensions of some well-known frailty models, providing along the way some insight into the precise asymptotic behavior of the proposed models. In Appendix 1, the definitions of the different classes of heavy-tailed distributions are provided. Regarding the asymptotic behavior, we have that where C is a positive constant, which follows from an eigenvalue decomposition of T . The first-order precise asymptotics for the different intensities from Table 1 are  provided in Table 2, where D, b, and c denote positive real-valued constants, which may change between intensities, but we write the same symbol for display purposes. Throughout the rest of this section, we use the same notational convention. As a particular case, take (y) = y −1 , > 0 . Then It was noted in Albrecher et al. (2021a) that ( , −(−T) ) is a PH representation. Thus, some simple calculations show that these distributions span the same class as the matrix-Weibull laws introduced in Albrecher and Bladt (2019). This is in contrast to the class of CPH distributions with stable mixing in Albrecher et al. (2021a), which only span the matrix-Weibull laws with ∈ (0, 1).
Regarding their asymptotic behavior, we have Table 3 gives the precise asymptotics for the different intensities of Table 1.
Example 3.6 (Inverse Gaussian scaling). Consider inverse Gaussian scaling with parameters > 0 and > 0 and density Then, the corresponding Laplace transform of Θ is given by We take the particular case = 1 and 2 = 1∕ . In this way Thus, Regarding the asymptotic behavior, we have that Table 4 gives the precise asymptotics for the different intensities of Table 1. where > 0 , > 0 and 0 < ≤ 1. This family includes the Gamma, inverse Gaussian and the positive stable distributions as particular cases. Here we assume that = 1 , which results in Regarding the asymptotic behavior, we have that which results in the same asymptotics of Table 3 for the positive stable case, but with replaced by .

Example 3.8 (Compound Poisson scaling). Consider a compound model
In general, the Laplace transform of Θ is given by In particular, for V ∼ Gamma ( , 1) and N ∼ Poisson ( ) , we obtain Note that this distribution has an atom at infinity with probability exp(− ) , corresponding to the probability of ℙ(N = 0) . In survival analysis terms, this means that an individual may never experience the event of interest with such probability. Considering N + 1 instead of N removes such an atom.
Example 3.9 (Discrete scaling). Assume that Θ is a discrete random variable taking values in { 1 , 2 , … } ⊂ ℝ + with corresponding probabilities = ( 1 , 2 , … ) , that is, ℙ(Θ = i ) = i , i = 1, 2, … . Then, Define the linear transformation T on ℝ ℕ given by Then, we can rewrite the survival function of Y as where ⊗ denotes the Kronecker product, and ̃ is a column vector of ones of appropriate dimension. This can be thought of as an infinite-dimensional IPH distribution. The case ≡ 1 recovers the class of NPH distributions introduced in Bladt et al. (2015).
Note that another approach to study the asymptotic behavior, and that is particularly convenient in the discrete scaling case, is to use the representation Y = h(Z∕Θ) , so that and employ the asymptotics of Z∕Θ . For instance, taking Θ ∼ Gamma ( , 1) , we have that Z∕Θ is regularly varying with index (see Albrecher et al. (2021a) for details). This leads to the same asymptotic results in Table 2 for the different choices of intensities . For the discrete scaling, we could take, for instance, Θ with Zeta distribution leading to the same asymptotic results.
As a second case, take V ∶= 1∕Θ with Weibull-type tail so that VZ has Weibulltype tail with shape parameter in (0, 1) (see Rojas-Nandayapa and Xie (2018)). Thus, the asymptotic behavior for the different intensities resemble those in Table 3.
Example 3.10 (Missing covariates in the proportional intensities model). Consider the proportional intensities model (also known as PH regression) introduced in Albrecher et al. (2021b) with vectors of observed and unobserved covariates 1 and 2 , respectively. Namely, the intensity is of the form Given that the vector 2 is unknown, the model cannot be employed in practice. However, we can assume that is an unobserved random variable independent of 1 . In this way, the scaled intensity model can be employed to account for the effect of omitted covariates by considering a parametric model for Θ . Such additional random component can thus help account for additional variability observed in data that cannot be explained by a simpler model.

Parameter estimation
In order to derive an EM algorithm for SIPH distribution, we first recall the corresponding algorithm for CPH distributions in Albrecher et al. (2021a) (see Bladt and Rojas-Nandayapa (2018) for the discrete scaling case). Consider y 1 , … , y K an i.i.d. sample from a CPH distributed random variable Y, which we will also denote by . Here, we assume that the scaling component Θ belongs to a parametric family depending on the parameter vector and denote by f Θ its corresponding density. We now make the following definitions. Let B k be the number of times the underlying Markov jump process of Y starts in state k, N kl the total number of transitions from state k to l until absorption, N k the number of times that k was the last state to be visited before absorption, and finally, let Z k be the cumulated time that the Markov jump process spent in state k. The detailed routine for estimation of CPH distributions is given in Algorithm 1.

Proposition 3.2 Algorithm 2 increases the likelihood function at each iteration. Since for fixed p, the likelihood of SIPH distributions is bounded, convergence towards a (possibly local) maximum is guaranteed.
Proof By the change of variable theorem, we have that Consider parameter values ( i , T i , i , i , i ) after the i-th iteration. Then the data log-likelihood after the i-th iteration is given by In the (i + 1)-th iteration, we first obtain ( i+1 , T i+1 , i+1 ) in 1. so that Finally, by 2.

Remark 3.4 The optimization problem
log(f Y (y n ;̂ ,T,̂ , , )) of Algorithm 2 is computationally heavy. However, observe that fewer iterations of any optimization routine are sufficient for the proof and conclusion of Proposition 3.2 to hold, and full convergence of (5) is not necessary. For instance, one step of the arg max routine can already provide good results.
Remark 3.5 (Incorporating right-censoring). Algorithm 2 can be modified to work with censored data. We illustrate the changes by considering only the case of right-censoring since it is the most common scenario in survival analysis applications. However, leftcensoring and interval-censoring can be treated by similar means. In such a case, we no longer observe Y = y but instead only that Y ∈ [v, ∞) . By monotonicity of h, we have that h −1 (Y; ) exp( ) ∈ [h −1 (v; ) exp( ), ∞) , which can be interpreted as a censored observation of a scaled PH distributed random variable. Moreover, in Albrecher et al. (2021a) (and Bladt and Rojas-Nandayapa (2018)), a modified EM algorithm for the estimation of scaled PH distributions is presented for the case of censored observations. This means that the main change in Algorithm 2 is in step 2, where we must now compute

Estimation for fractional PH distributions
A key distinction of the matrix Mittag-Leffler distribution (or fractional PH), with respect to the other models introduced in Section 3.1, is that the transformation h(x) = x 1∕ and the mixing distribution Θ = 1∕S depend on the same parameter . This makes statistical estimation very challenging by ad-hoc methods, and thus embedding into the SIPH class is useful for this purpose. Note that the transformation parameters are different from the scaling component's parameters for the previously presented models, and this last scenario is the central assumption in the derivation of Algorithm 2. Thus, special treatment must be taken for the estimation of matrix Mittag-Leffler distributions when seen as SIPH distributions. This is now solved by employing a modified EM algorithm, the details given in Algorithm 3.
By the same method of proof of Algorithm 2, one can show that Algorithm 3 increases the likelihood in each iteration, and hence we omit the details for brevity.

Shared scaling
This section presents a multivariate extension of SIPH distributions, inspired by the construction principle of the shared frailty model. The key idea is to think of an underlying random variable which is a common scaling factor to all the coordinates of an independent random vector, creating dependency and heavy-tailedness all at once through the same mechanism.

A class of multivariate CPH distributions
Before going into full generality, we consider the case where there is no deterministic time-transform component. This allows for a more transparent treatment with explicit formulas. Thus, consider the conditionally independent random variables = (Y 1 , … , Y d ) ⊤ given Θ = such that Then, the joint survival function of is given by where ⊕, ⊗ denote the Kronecker sum and product, respectively. In particular, this yields the joint density is the derivative of order d of L Θ (u) , which can again be shown by the use of functional calculus through Cauchy's formula. Moreover, marginally we get continuously scaled PH behavior: Alternatively, it is easy to see that has representation (Y These multivariate distributions were studied from another perspective in Furman et al. (2021), where the authors derived some properties in the context of risk management. We presently derive some probabilistic properties, provide an estimation method, and extend the class to allow for deterministic time transforms. In the next section we also allow for scaling of different components of the random vector by different (but correlated) scaling random variables. Since these distributions will be the building blocks of the more general time-inhomogeneous multivariate models presented in Section 4.3, a good understanding of the former facilitates the treatment of the latter.
Example 4.1 (Gamma scaling). Consider Θ ∼ Gamma ( , 1) , > 0 , then the joint survival function of is given by This distribution can be seen to be a matrix version of Mardia's multivariate Pareto distribution (see Mardia et al. (1962)).

Parameter estimation: multivariate CPH distributions
We now present a generalized EM algorithm for maximum-likelihood estimation of the class of multivariate CPH distributions introduced previously. The complete data is the scaling component Θ together with the conditionally independent Markov jump processes paths. We further assume that Θ belongs to a parametric family depending on the vector and denote by f Θ its corresponding density.
Consider observations n = (y (1) n , … , y (d) n ) ⊤ , n = 1, … , K , from a multivariate CPH distributed random vector, and let ̃ denote the whole data set. We also denote by ̃ and T the sets of parameters { 1 , … , d } and {T 1 , … , T d } , respectively, and (i) k and t (i) kl to refer to the entries of i and T i , i = 1, … , d . In order to write down the complete likelihood L c (̃ ,T, ;̃ ) , we need the following definitions. For each i = 1, … , d , let B i k be the number of times the underlying Markov jump process of Y i starts in state k, N i kl the total number of transitions from state k to l until absorption, N i k the number of times that k was the last state to be visited before absorption, and finally, let Z i k be the cumulated time that the Markov jump process spent in state k. Then, the complete likelihood is given by with corresponding log-likelihood (discarding the terms which do not depend on any parameters) Regarding the E-step, which consists of computing the conditional expectation of the log-likelihood given the observed data, the calculations are somewhat similar to those of Albrecher et al. (2021a). We illustrate the procedure by computing the conditional expectation of the logarithmic term. Consider one (generic) data point ( K = 1 ) and let = 1 . Then The formulas for all the other statistics are derived by similar calculations. Concerning the M-step, consisting of maximizing the conditional expected loglikelihood in terms of the parameters, for the parameter of the scaling component we have in full generality Regarding the PH component's parameters, the entries of the sub-intensity matrix can be found by direct differentiation of the log-likelihood, while for the vector of initial probabilities, we can employ a Lagrange multiplier argument. We omit further details for brevity. We summarize the complete procedure in Algorithm 4.

A class of multivariate SIPH distributions
We now proceed to incorporate deterministic time-inhomogeneity into the shared scaling construction. Consider conditionally independent random variables (Y 1 , … , Y d ) ⊤ given Θ = by Then )) ⊤ , which can be seen as follows This joint distribution can be seen to be a matrix-parameter version of the multivariate Weibull distribution introduced in Manatunga and Oakes (1999).

Remark 4.1
Covariates can be incorporated into the model by assuming that the intensities are of the form Remark 4.2 (Shared frailty model). In the shared frailty model, it is assumed that a group of individuals is conditionally independent given the frailty. In this way, the conditional joint survival function of | Θ = , = (Y 1 , … , Y d ) ⊤ , is given by where M i are baseline cumulative hazards, i = 1, … , d . Thus, the joint survival function of is given by Using that the above joint survival function can be rewritten as In particular, this means that the survival copula of is an Archimedean copula. Note that the shared frailty model is a particular case of the class of multivariate SIPH distributions introduced here when p = 1.
We now study the dependence structure of multivariate SIPH distributions. When p = 1 , the survival copula of is an Archimedean copula. To study the more general case, note that all the transformations presented in Table 1 are strictly increasing. This means that the copulas for models based on these intensities are the same as the ones of the models presented in Section 4.1, and thus it is enough to study the later case. Define the coefficient of upper tail dependence as Proof Given the definition of our model, Proposition 1 of Section 2 in Engelke et al. yields where Z i are PH( i , T i ), and Moreover, Z i ∕ (Z i ) 1∕ is PH distributed with the same vector of initial probabilities i and sub-intensity matrix T i = T i (Z i ) 1∕ , i = 1, 2 . This implies that which now yields Note that the resulting explicit expression for U is in terms of the parameters of the PH components. For instance, when considering Θ ∼ Gamma ( , 1) , the survival copula of the model can be different from the Clayton copula, for which U = 2 − . In Figure 1, we take the same value = 1 and plot the implicit copula of two multivariate CPH distributions, one with upper tail dependence coefficient smaller than 2 −1 and the other larger than 2 −1 , achieved solely by changing the parameters of the PH components.

Parameter estimation: multivariate SIPH distributions
If we assume that i ( ⋅ ; i ) is a parametric function depending on the vector i , i = 1, … , d , and let = ( 1 , … , d ) . Then we can use that (h −1 to formulate a generalized EM algorithm for maximum-likelihood estimation, which generalizes Algorithm 2 to the multivariate case.

Correlated scaling
We now extend the scaling of the sub-intensity matrix of SIPH distributions to the case where we condition on a random vector, the scaling factors being the components of such vector. We consider first the conditionally PH case, i.e. when no deterministic time-transform is present, and a scaling vector = (Θ 1 , … , Θ d ) ⊤ and = (Y 1 , , … , Y d ) ⊤ such that the random variables Y i are conditionally independent given with laws Then, in full generality, the joint survival function of is given by Consider the bivariate case. Then, using functional calculus, we have that that the joint survival function takes the explicit form where L is the joint Laplace transform of , that is

Parameter estimation: correlated CPH distributions
The maximum-likelihood estimation of this class of multivariate distributions can be performed via a generalized EM algorithm. The derivation is done similarly to Algorithm 4 and thus omitted for brevity. Again, for estimation, we assume that belongs to a parametric family depending on the vector and denote by f its corresponding joint density. The resulting detailed routine is provided in Algorithm 6. S ( ) = � 1 exp 1 T 1 y 1 2 exp 2 T 2 y 2 dF ( ) Remark 5.1 This algorithm suffers from the curse of dimensionality. The integrals above must typically be computed numerically, given that explicit expressions are not available. Thus, the number of summands needed for the approximation increases rapidly with the dimension. It is also important to mention that correlated frailty models are typically employed only in the bivariate case. In such a case, the above algorithm is computationally feasible, thus its relevance.

Correlated SIPH distributions
We now introduce an analogous model to the correlated frailty model based on IPH distributions, effectively the most general of our models. Consider a multivariate random scaling component such that Y i are conditionally independent given with conditional distribution The joint survival function of is then given by In the bivariate case, we have by simple calculations (using functional calculus) the explicit expression Note that an alternative representation for is where Z i are independent PH distributed random variables independent of . The proof is akin to those of previous sections. Now we consider a specific example with explicit joint density, namely the correlated Gamma case.
Example 5.1 (Correlated Gamma scaling). Inspired by Yashin et al. (1995), we consider = (Θ 1 , Θ 2 ) ⊤ such that We then fitted a matrix Mittag-Leffler distribution with the same number of phases to the resulting sample using Algorithm 3, obtaining the following parameters: Observe that we can somewhat retrieve the parameters by keeping in mind possible permutation of states (since their labels are not relevant). Figure 2 shows that the algorithm recovers the structure of the data. Moreover, note that ̂= 0.7928 , which determines the heaviness of the tail, is close to the original value = 0.8 . As further evidence of the quality of the fit, we have that the loglikelihood of the fitted model is −1, 769.596 , while using the original distribution parameters and structure, we obtain −1, 773.453.

Matrix-Weibull
Algorithm 2 can be easily modified to approximate given theoretical distributions. As in the PH case (Asmussen et al. (1996)), the idea consists of considering sequences of empirical distributions with increasing sample size. For instance, if we denote by g the theoretical given density that we want to approximate, in step 1, we have that as K → ∞, The rest of the formulas in step 1 are adapted through the same limit. Regarding step 2, we have As a concrete example, we consider a Matrix-Weibull distribution (as introduced in Albrecher and Bladt (2019), having no random scaling component) with density function and parameters Then we fitted a SIPH distribution of 3 phases with baseline intensity (y) = y −1 , > 0 , and positive stable scaling. The fitted parameters are the following The quality of the approximation is supported by Figure 3, which shows that we recover the shape of the original distribution. Moreover, the product ̂̂= 1.9867 , which determines the heaviness of the tail, can be compared with = 2 for the given theoretical model.

Real-life data
The Gamma-Gompertz frailty model is commonly employed for modeling human mortality at old ages (see, e.g., Missov (2013); Vaupel et al. (1979)). In the present example, we propose using SIPH distributions with Gamma scaling and Gompertz baseline intensity for modeling this type of data.
As a concrete case of study, we consider the lifetimes of the Swedish population that die in the year 2011 between ages 50-100. This data was obtained from the Human Mortality Database (HMD). We add covariate information by considering a separation between females ( X = 1 ) and males ( X = 0 ) in the population. Then we fitted a SIPH distribution of 4 phases with general Coxian structure in the PH component. The estimated parameters are Figure 4 shows that the fitted distribution provides a reasonable model for both groups. If an even closer fit is sought, other parameters of the model need to be regressed as well.

Multivariate example
We generated an i.i.d. sample of size 2, 500 from a bivariate CPH distribution with parameterŝ  and Gamma scaling with = 1.5 . Note that the upper tail dependence coefficient of the theoretical model is U = 0.2765 , while the empirical estimator of the sample is ̂U = 0.28 . Then we fitted a bivariate CPH model of same dimensions using Algorithm 4 obtaining the parameters Figure 5 shows that we recover the structure of both marginals. Regarding the dependence structure, this is supported by Figure 6, where we offer some contour plots. Moreover, note that the parameter that determines the heaviness of the tails of the marginals is close to the original model and that the coefficient of upper tail  dependence U = 0.254 is close to the original (and sample) one. Finally, note that the original model's log-likelihood is −11, 753.27 , compared with −11, 752.45 for the fitted model.

Conclusion
We have provided a phase-type-based model which can result in non-exponential tail behavior by introducing random and deterministic transformations. The resulting model is generally tractable in terms of matrix calculus through the Laplace transform of the random component, and thus closed-form formulas allow for statistical and probabilistic treatments, for instance, for fully explicit generalized EM algorithms. In the univariate case, the current three main ways of generating heavytailed phase-type distributions fall into our framework, and several new models are introduced to complement the existing suite of hidden Markov models. In the multivariate case, we obtain generalizations of well-known frailty models with fully explicit densities, contrary to other approaches of multivariate phase-type distributions in the literature (in terms of rewards or copulas). We finally show the feasibility of the statistical implementation of our models using four different examples. Heavy-tailed phase-type distributions are statistically attractive since their interpretation in terms of an underlying evolving process is natural in many domains of application which involve processes that traverse numerous states through time, for instance, human lifetimes or legal cases. With the models and algorithms provided in this paper, we aim to provide a clearer picture of the possibilities and limitations of Markov models for practitioners that require non-standard but interpretable models. A promising further direction of research for generating uni-and multivariate scaled phase-type distributions is to consider a general stochastic process as timechange, which for certain choices may provide fully explicit functionals and estimation procedures while remaining conceptually simple.