1 Introduction

The most commonly used regression model in general insurance pricing is the compound Poisson model with gamma claim sizes. State-of-the-art industry practice fits two separate generalized linear models (GLMs) to the two parts of this model, namely, a Poisson GLM to claim counts and a gamma GLM to claim amounts. Both the Poisson and the gamma distributions belong to the exponential dispersion family (EDF). It has been noted by Tweedie [19] that the compound Poisson model with i.i.d. gamma claim sizes itself belongs to the EDF and, in fact, it closes the interval of power variance functions between the Poisson model and the gamma model, see Section 3 in Jørgensen [5]. As a result of Tweedie’s and Jørgensen’s findings we obtain two different parametrizations of the compound Poisson model with i.i.d. gamma claim sizes. Selection between these two different parametrizations has been explored in the work of Jørgensen–de Souza [6] in the context of GLM insurance pricing. Interestingly, to predict total claim amounts we need to fit two GLMs in the compound Poisson-gamma parametrization, whereas one GLM is sufficient to get the corresponding predictions within Tweedie’s EDF parametrization. This indicates that in GLM applications these two parametrizations are not fully consistent. This point has been raised by Smyth–Jørgensen [17] who propose to use a double generalized linear model (DGLM) in Tweedie’s parametrization to simultaneously model mean and dispersion parameters within the EDF.

The main purpose of this article is to revisit the work of Smyth–Jørgensen [17], and to give properties under which the two GLMs for claim counts and claim sizes and the DGLM with Tweedie’s EDF parametrization lead to the same predictive model; this involves a discussion about choices of covariate spaces and GLM link functions. Based on this, our first main contribution provides a new result for Tweedie’s DGLM that substantially reduces computational costs in calibrations of power variance parameters.

The second point that we explore is whether the insurance industry’s preference of using the Poisson-gamma parametrization can be justified. A priori it is not clear whether either of the two ways lead to better predictive models. This part of our work is based on GLMs and on their neural network extensions. We receive evidence that supports the industry preference, in particular, under the choice of neural network regression models the Poisson-gamma parametrization is simpler in calibration and leads to more robust results.

We close this introduction with a number of remarks. First, we mention the recent survey paper of Quijano Xacur–Garrido [12], which has similar goals to the present paper. This survey only considers the single GLM case of Tweedie’s parametrization, similar to Jørgensen–de Souza [6]. We emphasize that the full picture can only be obtained by comparing the Poisson-gamma parametrization to the DGLM case introduced in Smyth–Jørgensen [17]. Therefore, we revisit and extend this latter reference to receive a comprehensive comparison. Our view is supported by examples. These examples provide a proof of concept for situations with claims that are not too heavy tailed. However, these examples also highlight the weaknesses of this model on real insurance data, which often exhibits heavier tails than what is suitable under a gamma assumption. We remark that in our discussion we use the terminology of general insurance pricing, however, as commonly the case in general insurance, all our findings can be translated one-to-one to claims reserving problems.

Organization of the paper In Sect. 2 we introduce the compound Poisson model with i.i.d. gamma claim sizes and we derive its corresponding Tweedie parametrization. In Sect. 3 we embed both approaches into a GLM framework. We present the two GLMs needed for the Poisson-gamma parametrization, and we discuss a single GLM and a DGLM parametrization for Tweedie’s approach. Our main results, Theorems 3.6 and 3.8, give conditions under which the different GLM parametrizations lead to identical predictive models. These theorems provide a remarkable property that allows us to lower calibration costs in Tweedie’s DGLMs. In Sect. 4 we give insights and intuition based on numerical examples both under GLMs and neural network regression models. In Sect. 5 we conclude, and the “Appendix” gives a short summary of GLMs and describes the data used.

2 Tweedie’s compound Poisson model

In Sect. 2.1 we introduce the compound Poisson model with i.i.d. gamma claim sizes, and in Sect. 2.2 we revisit its Tweedie counterpart. For simplicity, in these two sections, we think of using these models for modeling one single insurance policy only. In Sect. 3, below, we consider multiple insurance policies also allowing for heterogeneity between policies.

2.1 Compound Poisson model with i.i.d. gamma claim sizes

Let N be the number of claims and let \((Z_j)_{j\ge 1}\) be the corresponding claim sizes. We assume that the number of claims, N, is Poisson distributed with mean \(\lambda w\), where \(\lambda >0\) is the expected claim frequency relative to a given exposure \(w>0\); we write \(N\sim \mathrm{Poi}(\lambda w)\). We assume that the claim sizes \(Z_j\), \(j\ge 1\), are i.i.d. and independent of N having a gamma distribution with shape parameter \(\gamma >0\) and scale parameter \(c>0\); we write \(Z_1 \sim {{\mathcal {G}}}(\gamma ,c)\) for this gamma distribution. The moment generating function of the gamma claim sizes is given by, see Section 3.2.1 in [20],

$$\begin{aligned} {{\mathbb {E}}}\left[ \exp \{r Z_1\}\right] = \left( \frac{c}{c-r}\right) ^\gamma , \qquad \text { for\, r<c.} \end{aligned}$$

The compound Poisson model with i.i.d. gamma claim sizes (CPG) is then defined by \(S=\sum _{j=1}^N Z_j\); we use notation \(S\sim \mathrm{CPG}(\lambda w, \gamma , c)\). The moment generating function of S is given by

$$\begin{aligned} {{\mathbb {E}}}\left[ \exp \left\{ r S \right\} \right] = \exp \left\{ \lambda w \left( \left( \frac{c}{c-r}\right) ^\gamma -1 \right) \right\} , \quad \text { for\, r<c,} \end{aligned}$$
(2.1)

we refer to Proposition 2.11 in [20].

2.2 Tweedie’s compound Poisson model

Following [5, 6, 17, 19] we select a particular model within the EDF. A random variable Y belongs to the EDF if its density has the following form (w.r.t. a \(\sigma\)-finite measure on \({{\mathbb {R}}}\))

$$\begin{aligned} Y~\sim ~ f(y; \theta , w/\phi )= \exp \left\{ \frac{y\theta - \kappa (\theta )}{\phi /w} + a(y;w/\phi )\right\} , \end{aligned}$$
(2.2)

with \(w>0\) is a given exposure (weight, volume), \(\phi >0\) is the dispersion parameter, \(\theta \in \varvec{\Theta }\) is the canonical parameter in the effective domain \(\varvec{\Theta }\), \(\kappa :\varvec{\Theta }\rightarrow {{\mathbb {R}}}\) is the cumulant function, \(a(\cdot ; \cdot )\) is the normalization, not depending on the canonical parameter \(\theta.\)

For properties of the EDF we refer to “Appendix A”, below. Tweedie’s compound Poisson (CP) model is obtained by choosing for \(p \in (1,2)\) the cumulant function

$$\begin{aligned} \kappa (\theta ) =\kappa _p(\theta ) = \frac{1}{2-p} \left( (1-p){\theta }\right) ^{\frac{2-p}{1-p}}, \quad \text { on effective domain } \theta \in \varvec{\Theta }={{\mathbb {R}}}_-=(-\infty ,0). \end{aligned}$$
(2.3)

We use notation \(Y\sim \mathrm{Tweedie}(\theta , w, \phi , p)\). The first two derivatives of the cumulant function provide the first two moments of Y, see also (A.1) in the “Appendix”,

$$\begin{aligned} \mu= & {{\mathbb {E}}}\left[ Y\right] = \kappa _p'(\theta ) = \left( (1-p){\theta }\right) ^{\frac{1}{1-p}},\end{aligned}$$
(2.4)
$$\begin{aligned} \mathrm{Var}\left( Y\right)= & \frac{\phi }{w}\kappa _p''(\theta ) = \frac{\phi }{w} \left( (1-p){\theta }\right) ^{\frac{p}{1-p}} = \frac{\phi }{w}\mu ^p. \end{aligned}$$
(2.5)

Hyper-parameter \(p\in (1,2)\) allows us to model the power variance functions \(V(\mu )=\mu ^p\) between the Poisson boundary case \(p=1\) and the gamma boundary case \(p=2\), we refer to Sect. 3.1, below, for the boundary cases. \(\mu \mapsto \theta =(\kappa _p')^{-1}(\mu )\) gives the canonical link of Tweedie’s CP model.

We calculate the moment generating function of the exposure scaled Tweedie’s CP random variable wY, see also Corollary 7.21 in [20],

$$\begin{aligned} {{\mathbb {E}}}\left[ \exp \{r wY\}\right]= & \exp \left\{ \frac{w}{\phi }\left( \kappa _p(\theta +r\phi ) -\kappa _p(\theta )\right) \right\} \\= & \exp \left\{ \frac{w}{\phi }~\kappa _p(\theta ) \left( \left( \frac{-\theta /\phi }{-\theta /\phi -r} \right) ^{\frac{2-p}{p-1}} - 1\right) \right\} , \quad \text { for } r<-\theta /\phi . \end{aligned}$$

Note that this is a CPG model in a different parametrization; we call the model under this EDF parametrization Tweedie’s CP model. The following proposition follows by comparing the corresponding moment generating functions.

Proposition 2.1

Choose \(S \sim \mathrm{CPG}(\lambda w, \gamma , c)\) and \(Y\sim \mathrm{Tweedie}(\theta , w, \phi , p)\). We have identity in distribution \(S/w{\mathop =\limits ^\mathrm{(d)}}Y\) under parameter identification

$$\begin{aligned} \gamma =\frac{2-p}{p-1}\Leftrightarrow & p=\frac{\gamma +2}{\gamma +1} ~\in ~(1,2), \end{aligned}$$
(2.6)
$$\begin{aligned} c= & {-\theta }/{ \phi }, \end{aligned}$$
(2.7)
$$\begin{aligned} \lambda= & \frac{1}{\phi }~\kappa _p(\theta ) . \end{aligned}$$
(2.8)

Formula (2.8) can be rewritten in different ways. We have, using the canonical link of Tweedie’s CP model, \(\theta =(\kappa _p')^{-1}(\mu ) = \mu ^{1-p}/(1-p)\) and \(\kappa _p(\theta )=\kappa _p ((\kappa _p')^{-1}(\mu ))=\mu ^{2-p}/{(2-p)}\). This implies, using (2.7) in the second step and (2.6) in the last step,

$$\begin{aligned} \lambda = \frac{1}{\phi }~\kappa _p(\theta )~=~ \frac{c}{-\theta }~\kappa _p(\theta )~=~ c~\frac{p-1}{\mu ^{1-p}}~\frac{\mu ^{2-p}}{2-p} ~=~ \frac{c}{\gamma }~\mu . \end{aligned}$$
(2.9)

The latter says that, of course, the expected claim frequency \(\lambda\) is obtained by dividing the expected total claim amount \({{\mathbb {E}}}[Y]=\mu\) by the average claim size \({{\mathbb {E}}}[Z_1]=\gamma /c\).

Thus, under parameter identification scheme (2.6)–(2.8) the two models are identical:

$$\begin{aligned}&\mathrm{Tweedie}\left( \theta , w, \phi , p\right) {\mathop =\limits ^\mathrm{(d)}} \mathrm{CPG}\left( \frac{w}{\phi }\kappa _p(\theta ), \frac{2-p}{p-1}, \frac{-\theta w}{\phi }\right) ,\qquad \text { or}\\&\mathrm{CPG}\left( \lambda w, \gamma , c\right) {\mathop =\limits ^\mathrm{(d)}} \mathrm{Tweedie}\left( (\kappa _p')^{-1}\left( \lambda w \frac{ \gamma }{c}\right) , w, \frac{-w}{c}(\kappa _p')^{-1}\left( \lambda w\frac{ \gamma }{c}\right) , \frac{\gamma +2}{\gamma +1}\right) . \end{aligned}$$

This illustrates that there is a one-to-one correspondence between the CPG parametrization and Tweedie’s CP parametrization, i.e. the two models are identical and only differ in interpretation of parameters. The next section will demonstrate that these subtle differences can be crucial for GLM regression modeling, and resulting models can be rather different as functions of explanatory covariates, see Sect. 3.3 below.

3 Generalized linear models and parameter estimation

In this section we study multiple insurance policies \(i=1,\ldots , n\) having claim distributions \(\mathrm{CPG}(\lambda _i w_i, \gamma , c_i)\) and \(\mathrm{Tweedie}(\theta _i, w_i, \phi _i, p)\), respectively. We allow for heterogeneity between the policies in all parameters that have a lower index i. We describe modeling and parameter estimation within GLMs: we consider two GLMs to model \(\lambda _i\) (Poisson) and \(-c_i/\gamma\) (gamma) in the former case, and we consider a DGLM to model \(\theta _i\) and \(\phi _i\) in the latter case. There is a slight difference between “two GLMs” and “double GLM”, the former considers two independent GLMs, the latter does a simultaneous consideration of two GLMs. The volumes \(w_i\) are assumed to be known and do not need any modeling. The shape parameter \(\gamma >0\) and the power variance parameter \(p=(\gamma +2)/(\gamma +1)\), see (2.6), are assumed to be the same for all policies i, this is a standard assumption in state-of-the-art use of these GLMs. An overview of GLMs and their parameter estimation within the EDF is given in “Appendix A”.

3.1 Compound Poisson model with i.i.d. gamma claim sizes

We begin with the CPG model. Since the log-likelihood function of the CPG model decouples into two separate parts for claim counts and claim sizes, maximum likelihood estimation (MLE) of claim counts and claim size models can be done independently from each other. We start from n independent random variables \(S_i\sim \mathrm{CPG}(\lambda _iw_i,\gamma ,c_i)\) with

$$\begin{aligned} S_i= \sum _{j=1}^{N_i} Z_{i,j}, \quad \text { for insurance policies } i=1,\ldots , n. \end{aligned}$$

The joint log-likelihood function of this model, given observations \((N_i)_i\) and \((Z_{i,j})_{i,j}\) and weights \((w_i)_i\), is given by

$$\begin{aligned}&\ell ((\lambda _i)_{i=1,\ldots , n}, \gamma , (c_i)_{i=1,\ldots , n})\nonumber \\&\quad = \sum _{i=1}^n \bigg (-\lambda _i w_i + N_i\log (\lambda _i w_i)-\log (N_i!)\nonumber \\&\quad +~\sum _{j=1}^{N_i} \gamma \log (c_i)-\log \Gamma (\gamma )+(\gamma -1)\log (Z_{i,j}) -c_iZ_{i,j}\bigg ), \end{aligned}$$
(3.1)

where the term on the second line is zero for \(N_i=0\). Remark that in this log-likelihood function (for parameter estimation) we treat \((N_i)_i\) and \((Z_{i,j})_{i,j}\) as known observations; for notational convenience we do not use small fonts for observations. From (3.1) we now see that we can estimate the Poisson parameters \(\lambda _i\) and the gamma parameters \(\gamma\) and \(c_i\) independently from each other; the former uses observations \((N_i)\) and the latter observations \((N_i)_i\) and \((Z_{i,j})_{i,j}\).

Furthermore, we assume that each insurance policy \(i=1,\ldots , n\) is established with covariate information \({\varvec{x}}_i =(x_{i,0},\ldots , x_{i,d})'\in {{\mathcal {X}}} \subset \{1\}\times {{\mathbb {R}}}^{d}\), having initial component \(x_{i,0}= 1\) for modeling the intercept component.

GLM for claim counts: Assume that the expected frequencies \(\lambda _i=\lambda ({\varvec{x}}_i)\) of policies \(i=1,\ldots , n\) can be modeled by a log-linear regression function

$$\begin{aligned} \lambda : {{\mathcal {X}}} \rightarrow {{\mathbb {R}}}_+,\qquad {\varvec{x}}\mapsto \lambda ({\varvec{x}}) = \exp \left\langle \varvec{\beta },{\varvec{x}}\right\rangle = \exp \left\{ \beta _0 + \sum _{k=1}^d\beta _k x_k \right\} , \end{aligned}$$
(3.2)

with regression parameter \(\varvec{\beta }=(\beta _0,\ldots , \beta _d)'\in {{\mathbb {R}}}^{d+1}\). Assuming that the design matrix \({\mathfrak {X}}=({\varvec{x}}_1,\ldots , {\varvec{x}}_n)'\in {{\mathbb {R}}}^{n\times (d+1)}\) has full rank \(d+1\) we find the unique MLE \(\widehat{\varvec{\beta }}\) for \(\varvec{\beta }\) by the (unique) solution of

$$\begin{aligned} {\mathfrak {X}}'~\mathrm{diag}(w_1,\ldots , w_n) \left( \left( \frac{N_1}{w_1},\ldots , \frac{N_n}{w_n}\right) '-\exp \{{\mathfrak {X}}\varvec{\beta }\} \right) ={\varvec{0}}. \end{aligned}$$
(3.3)

Remark that the Poisson distribution has an EDF representation with cumulant function \(\kappa (\cdot )=\kappa _1(\cdot )=\exp \{\cdot \}\). The lower index \(p=1\) in the cumulant function \(\kappa _1(\cdot )\) indicates that we have variance function \(V(\lambda )=\lambda\) in the Poisson case, see also (2.5). The choice (3.2) corresponds to the canonical link \((\kappa _1')^{-1}(\cdot )=\log (\cdot )\) in the Poisson GLM. The choice of the canonical link implies that we receive an unbiased portfolio estimate, see [21]. The score Eq. (3.3) is solved numerically, for details see (A.3) in “Appendix A”.

GLM for gamma claim sizes: Consider only insurance policies i which have claims, i.e. with \(N_i>0\). All subsequent considerations in this paragraph are conditional on \(N_i\). The average claim amount on policy i has a conditional gamma distribution

$$\begin{aligned} {\bar{Z}}_i= \frac{1}{N_i}\sum _{j=1}^{N_i} Z_{i,j}\bigg |_{\{N_i\}} \sim {{\mathcal {G}}}(\gamma N_i, c_i N_i), \end{aligned}$$
(3.4)

with shape parameter \(\gamma N_i\) and scale parameter \(c_i N_i\) (note that \(\gamma\) is not policy i dependent). This gamma distributed random variable has conditional mean and variance given by

$$\begin{aligned} \zeta _i={{\mathbb {E}}}[ {\bar{Z}}_i | N_i] = \frac{\gamma }{c_i} \quad \text { and } \quad \mathrm{Var}( {\bar{Z}}_i | N_i) = \frac{\gamma }{c^2_i N_i} = \frac{1}{\gamma N_i} \left( \frac{\gamma }{c_i}\right) ^2 = \frac{1}{\gamma N_i} \zeta _i^2. \end{aligned}$$

This model belongs to the EDF (2.2) with cumulant function \(\kappa _2(\theta )=-\log (-\theta )\) for \(\theta \in \varvec{\Theta } ={{\mathbb {R}}}_-\), dispersion parameter \(\phi =1/\gamma\) and exposure \(w_i=N_i\). The conditional mean and variance are

$$\begin{aligned} \zeta _i= {{\mathbb {E}}}[ {\bar{Z}}_i | N_i] = \kappa _2'(\theta _i) = -\frac{1}{\theta _i}\qquad \text {and} \qquad \mathrm{Var}( {\bar{Z}}_i | N_i) = \frac{1}{\gamma N_i}\kappa _2''(\theta _i) = \frac{1}{\gamma N_i}\left( -\frac{1}{\theta _i}\right) ^2. \end{aligned}$$

This is the boundary case \(p=2\) in Tweedie’s CP model with power variance function \(V(\zeta )=\zeta ^2\), see (2.5).

We set up a second GLM for gamma claim size modeling. This second GLM does not necessarily need to rely on the same covariate space \({{\mathcal {X}}}\) as the Poisson GLM (3.2) for claim counts modeling. To emphasize this point, we introduce a new covariate space containing covariate information \({\varvec{z}}_i =(z_{i,0},\ldots , z_{i,q})'\in {{\mathcal {Z}}} \subset \{1\}\times {{\mathbb {R}}}^{q}\) having initial component \(z_{i,0}= 1\) modeling the intercept. We interpret the choices \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) as follows: both covariates \({\varvec{x}}_i \in {{\mathcal {X}}}\) and \({\varvec{z}}_i \in {{\mathcal {Z}}}\) should belong to the same insurance policy i, however, inclusion of individual covariate components and pre-processing of these components may differ in the two different regression models. This is a result of aiming at optimizing the predictive performance of both regression models.

We make the following regression assumption: choose a suitable link function \(g_2(\cdot )\) to receive the linear predictor, see also “Appendix A”,

$$\begin{aligned} g_2(\zeta _i)= g_2({{\mathbb {E}}}[ {\bar{Z}}_i| N_i]) =g_2(\kappa _2'(\theta _i))=g_2(-1/\theta _i)= \eta _i = \left\langle \varvec{\alpha },{\varvec{z}}_i \right\rangle , \end{aligned}$$
(3.5)

for regression parameter \(\varvec{\alpha } \in {{\mathbb {R}}}^{q+1}\). Formula (3.5) explains the relationship between mean \(\zeta ={{\mathbb {E}}}[ {\bar{Z}}| N]=\kappa _2'(\theta )\), canonical parameter \(\theta\) and linear predictor \(\eta =\eta ({\varvec{z}})\). Usually, one does not select the canonical link in the gamma GLM because the negativity constraint on the canonical parameter \(\theta \in \varvec{\Theta } ={{\mathbb {R}}}_-\) may be too restrictive in choosing a linear functional regression form; this is in contrast to the Poisson GLM (3.2). Therefore, the choice of the link function \(g_2(\cdot )\) has to be done carefully, because we require \(1/\theta _i = -g_2^{-1}(\eta _i)=-g_2^{-1}\left\langle \varvec{\alpha }, {\varvec{z}}_i \right\rangle <0\) for all policies \(i=1,\ldots , n\), otherwise the canonical parameter \(\theta _i\) is not in the effective domain \(\varvec{\Theta }\). Below, we will choose the log-link for \(g_2\), which is a common choice for gamma GLMs.

These choices imply for the log-likelihood function, only considering policies \(i=1,\ldots , m\) with \(N_i>0\),

$$\begin{aligned} \ell (\varvec{\alpha }) = \sum _{i=1}^m \gamma N_i \left( {\bar{Z}}_i\theta _i-\kappa _2\left( \theta _i\right) \right) +a({\bar{Z}}_i; \gamma N_i). \end{aligned}$$
(3.6)

The MLE \(\widehat{\varvec{\alpha }}\) of \(\varvec{\alpha }\) is found by solving the score equation, see “Appendix A”,

$$\begin{aligned} { \nabla _{\varvec{\alpha }}\ell (\varvec{\alpha }) ={\varvec{0}} ~\Leftrightarrow ~ {\mathfrak {Z}}'W_2 {\varvec{R}}={\varvec{0}},} \end{aligned}$$
(3.7)

with design matrix \({\mathfrak {Z}}=({\varvec{z}}_1,\ldots , {\varvec{z}}_m)'\in {{\mathbb {R}}}^{m\times (q+1)}\), diagonal working weight matrix (using \(V(\zeta _i)= \zeta _i^{-2}\))

$$\begin{aligned} W_2 =\gamma ~ \mathrm{diag} \left( \left( \frac{\partial g_2(\zeta _i)}{\partial \zeta _i}\right) ^{-2} N_i \zeta _i^{-2} \right) _{i=1,\ldots , m}, \end{aligned}$$
(3.8)

and with working residual vector \({\varvec{R}}= (\frac{\partial g_2(\zeta _i)}{\partial \zeta _i}({\bar{Z}}_i -\zeta _i))_{i=1,\ldots , m}\).

Remarks 3.1

  • Shape parameter \(\gamma\) may be treated as a hyper-parameter, and the explicit choice of \(\gamma\) does not influence parameter estimation because it cancels in the score Eq. (3.7).

  • MLE (3.6)–(3.7) is expressed in sufficient statistics \({\bar{Z}}_i\), and we receive the same regression parameter estimate \(\widehat{\varvec{\alpha }}\) if we perform MLE directly on the individual claim sizes \(Z_{i,j}\). This is an important property, namely, the gamma GLM can be fit solely on the number of claims \(N_i\) and the total claim amount \({\bar{Z}}_i\) on each policy i. Moreover, this estimated model still allows us to simulate individual claim sizes \(Z_{i,j}\). Thus, GLM regression parameter estimation does not differ whether we consider total claim amounts \({\bar{Z}}_i\) or individual claim sizes \(Z_{i,j}\). On the other hand, the process of model and variable selection might give different results in the two estimation cases (\({\bar{Z}}_i\) vs. \(Z_{i,j}\)) because the log-likelihood functions and the estimates for \(\gamma\) differ, this is, e.g., important for model selection using likelihood ratio tests or Akaike’s information criterion, see Remarks 3.10, below.

  • If we model claim counts and claim sizes separately, we use maximal available information \(N_i\) and \(Z_{i,j}\). Moreover, we can design covariate spaces \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) in an optimal way, and independently from each other.

  • If (3.5) is not based on the canonical link of the gamma model, the balance property will not be fulfilled, see [22]. This should be corrected by shifting the intercept parameter \(\beta _0\) correspondingly. Often one chooses the log-link for \(g_2(\cdot )\), under the log-link choice we can also reformulate the regression problem by replacing the average claim amount response (3.4) by the (conditional) total claim amount \(S_i|_{\{N_i\}}\) and treating \(\log (N_i)\) as a known offset in the linear predictor.

  • Shape parameter \(\gamma <1\) leads to an over-dispersed model with strictly decreasing density, and for \(\gamma >1\) the density is uni-modal. Above \(\gamma\) is treated as a hyper-parameter, and below we discuss MLE of \(\gamma\).

  • If shape parameter \(\gamma _i\) needs explicit modeling as a function of i, then (3.7)–(3.8) will no longer have such a simple structure, and MLE of \(\varvec{\alpha }\) will depend on the explicit choices of \(\gamma _i\). In this case, one can either use a gamma DGLM or one can rely on the 2-dimensional exponential family. The latter model is less tractable numerically. It considers cumulant function \(\kappa (\theta _1,\theta _2) = \log \Gamma (\theta _2)-\theta _2\log (-\theta _1)\) for scale parameter \(c=-\theta _1>0\) and shape parameter \(\gamma =\theta _2>0\). This gives inverse link function, see [21],

    $$\begin{aligned} \nabla _{(\theta _1,\theta _2)} \kappa (\theta _1,\theta _2)=\left( \frac{\theta _2}{-\theta _1}, \frac{\Gamma '(\theta _2)}{\Gamma (\theta _2)}-\log (-\theta _1)\right) ', \end{aligned}$$

    the first component being the mean of the gamma distributed random variable Z, and the second component being the mean of \(\log (Z)\). We do not further follow up this approach because we would lose the connection to Tweedie’s CP approach with a policy independent power variance parameter, see next section.

There remains estimation of shape parameter \(\gamma\) for given MLE \(\widehat{\varvec{\alpha }}\). One could either use Pearson’s dispersion estimate for \(1/\gamma\) or directly calculate the MLE of \(\gamma\). In view of (3.6), the MLE is obtained from score equation \(\frac{\partial }{\partial \gamma } \ell (\widehat{\varvec{\alpha }}, \gamma ) =0\), which yields

$$\begin{aligned} \sum _{i=1}^m N_i \left( {\bar{Z}}_i{\widehat{\theta }}_i-\kappa _2({\widehat{\theta }}_i)\right) +N_i \log (\gamma N_i) +N_i+ N_i\log ({\bar{Z}}_i)-N_i\frac{\Gamma '(\gamma N_i)}{\Gamma (\gamma N_i)}=0, \end{aligned}$$
(3.9)

where we set \({\widehat{\theta }}_i= -\exp \{-\langle \widehat{\varvec{\alpha }},{\varvec{z}}_i\rangle \}\). Either we solve this score equation numerically using the Newton-Raphson algorithm, or we plot the one-dimensional log-likelihood function \(\gamma \mapsto \ell (\widehat{\varvec{\alpha }},\gamma )\) and determine the MLE \({\widehat{\gamma }}\) from this plot, see Fig. 1, below, for an example.

We conclude by calculating Fisher’s information matrix for \((\varvec{\alpha },\gamma )\) in our gamma GLM. We have, see “Appendix A”,

$$\begin{aligned} -{{\mathbb {E}}}\left[ \left. \nabla ^2_{\varvec{\alpha }}\ell (\varvec{\alpha },\gamma )\right| N_1,\ldots , N_m\right] = {\mathfrak {Z}}'W_2 {\mathfrak {Z}}. \end{aligned}$$

For the second derivative of the \(\gamma\) term we have

$$\begin{aligned} -{{\mathbb {E}}}\left[ \left. \frac{\partial ^2}{\partial \gamma ^2} \ell (\varvec{\alpha }, \gamma )\right| N_1,\ldots , N_m\right] = -\sum _{i=1}^m \frac{N_i}{\gamma }-N_i^2 \psi '(x) \bigg |_{x=\gamma N_i}, \end{aligned}$$

where the second order derivative \(\psi '(x)=\frac{d^2}{dx^2}\log \Gamma (x)\) of the log-gamma function is known as the trigamma function, see [10, Sec. 5.15]. The trigamma function is directly available in the statistical software R [13]. For the off-diagonal terms we have

$$\begin{aligned} -{{\mathbb {E}}}\left[ \left. \nabla _{\varvec{\alpha }} \frac{\partial }{\partial \gamma } \ell (\varvec{\alpha },\gamma ) \right| N_1,\ldots , N_m\right] = -\sum _{i=1}^m N_i{{\mathbb {E}}}\left[ \left. {\bar{Z}}_i-\kappa '_2({\theta }_i)\right| N_1,\ldots , N_m\right] \nabla _{\varvec{\alpha }} {\theta }_i = {\varvec{0}}. \end{aligned}$$

This gives us the following Fisher’s information matrix for the gamma claim size modeling

$$\begin{aligned} {{\mathcal {I}}}(\varvec{\alpha }, \gamma ) = \left( \begin{array}{cc} {\mathfrak {Z}}'W_2{\mathfrak {Z}} & {\varvec{0}}\\ {\varvec{0}}'& -\sum _{i=1}^m {N_i}/{\gamma }-N_i^2 \left. \psi '(x) \right| _{x=\gamma N_i} \end{array} \right) . \end{aligned}$$

3.2 Tweedie’s compound Poisson generalized linear model

3.2.1 Homogeneous dispersion case

From Sect. 2.2 we know that Tweedie’s CP model belongs to the EDF, thus, GLM is straightforward. In this subsection we start with the homogeneous dispersion parameter \(\phi >0\) case; this case will not be supported in Remarks 3.2, below. We assume having n independent random variables \(Y_i \sim \mathrm{Tweedie}(\theta _i, w_i, \phi , p)\), and we choose hyper-parameter \(p=(\gamma +2)/(\gamma +1) \in (1,2)\) to make Tweedie’s CP model consistent with the CPG case, see Proposition 2.1. Choosing a suitable link function \(g_p(\cdot )\) we make the following regression assumption for the linear predictor

$$\begin{aligned} g_p(\mu _i)= g_p({{\mathbb {E}}}[ Y_i]) =g_p(\kappa _p'(\theta _i))= \eta _i = \left\langle \varvec{\beta }^*,{\varvec{x}}^*_i \right\rangle , \end{aligned}$$
(3.10)

where \({\varvec{x}}^*_i \in {{\mathcal {X}}}^*\subset \{1\}\times {{\mathbb {R}}}^{d^*}\) are the covariates of policy i and \(\varvec{\beta }^*\) is the regression parameter. We change the covariate notation compared to Sect. 3.1 because covariate pre-processing might be done differently for Tweedie’s CP model compared to the CPG case (because we consider different responses). In complete analogy with the above, MLE requires solving the score equations

$$\begin{aligned} { \nabla _{\varvec{\beta }^*}\ell (\varvec{\beta }^*)={\varvec{0}} ~\Leftrightarrow ~ {\mathfrak {X}}'W_p {\varvec{R}}={\varvec{0}},} \end{aligned}$$
(3.11)

with design matrix \({\mathfrak {X}}=({\varvec{x}}^*_1,\ldots , {\varvec{x}}^*_n)'\), diagonal working weight matrix (using \(V(\mu _i)=\mu _i^{p}\))

$$\begin{aligned} W_p =\frac{1}{\phi }~\mathrm{diag} \left( \left( \frac{\partial g_p(\mu _i)}{\partial \mu _i}\right) ^{-2} w_i \mu _i^{-p} \right) _{i=1,\ldots , n}, \end{aligned}$$
(3.12)

and working residual vector \({\varvec{R}}= (\frac{\partial g_p(\mu _i)}{\partial \mu _i}(Y_i -\mu _i))_{i=1,\ldots , n}\).

Remarks 3.2

There are a couple of crucial differences between Tweedie’s CP approach with homogeneous dispersion \(\phi\) and the CPG approach of the previous section:

  1. 1.

    The CPG approach of the previous section uses all available information of claim counts \(N_i\) and claim sizes \({\bar{Z}}_{i}\), whereas Tweedie’s CP approach with homogeneous dispersion parameter only uses total claim cost information \(Y_i\).

  2. 2.

    The former approach allows us to consider different covariate spaces \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) for claim counts and claim size modeling, whereas the latter approach only relies on one version of the covariate space \({{\mathcal {X}}}^*\).

  3. 3.

    The mean estimates \({\widehat{\lambda }}_i{\widehat{\zeta }}_i\) in the CPG case do not rely on the particular choice of the shape parameter \(\gamma\), whereas in the homogeneous dispersion Tweedie’s CP approach the mean estimates \({\widehat{\mu }}_i\) rely on the specific choice of power variance parameter \(p=(\gamma +2)/(\gamma +1)\) through the working weight matrix \(W_p\), see (3.12).

  4. 4.

    In general, the dispersions resulting from \(\mathrm{CPG}(\lambda _iw_i,\gamma ,c_i)\) are not constant:

    $$\begin{aligned} \mathrm{Var}(S_i/w_i)= & w_i^{-2}{{\mathbb {E}}}[N_i]{{\mathbb {E}}}[Z_{i,1}^2]~=~w_i^{-1}\lambda _i\left( \frac{\gamma }{c^2_i}+ \frac{\gamma ^2}{c_i^2}\right) ~=~w_i^{-1}\lambda _i\zeta _i~\frac{1+\gamma }{c_i}\\= & w_i^{-1}\mu _i^p~\frac{\mu _i^{1-p}}{c_i (p-1)} =w_i^{-1}\left( \frac{-\theta _i}{c_i}\right) \mu _i^p =\frac{\phi _i}{w_i}~\mu _i^p. \end{aligned}$$

    The dispersion can only be constant if \(\phi _i=-\theta _i/c_i\) does not depend on i. Typically, this is not the case, see also Conclusions and Remarks 3.9, below. Therefore, we need to extend the homogeneous dispersion case of Tweedie’s CP model to a DGLM Tweedie’s CP model, otherwise it cannot be compared to the CPG case, which is more flexible in dispersion modeling. For more analysis of the homogeneous dispersion case see [12].

3.2.2 Heterogeneous dispersion case

As stated in Remarks 3.2, the homogeneous dispersion Tweedie’s CP approach does not use full information of claim counts and claim costs and it does not allow for flexible dispersion modeling \(\phi _i\). In Section 2 of [17], the authors raise the point that in applications of Tweedie’s CP model to insurance claim data it is important to use full information so that also the dispersion parameter \(\phi _i\) is modeled flexibly. As a consequence, the dispersion parameter cannot be factored out as in (3.12), and it does not cancel in optimization (3.11). Therefore, [17] propose to use the framework of DGLMs which was introduced and developed by [8, 15, 18]. DGLMs allow for simultaneous modeling of both mean and dispersion parameters by using a second GLM for the dispersion parameter \(\phi _i\). The two GLMs are jointly calibrated using claim count and claim cost information. The joint density of a single case (NY) has been derived in formula (11) of [6]:

$$\begin{aligned} (N,Y)~\sim ~f(n,y;\theta , w/\phi ) = \exp \left\{ \frac{y\theta - \kappa _p(\theta )}{\phi /w} + a(n,y;w/\phi )\right\} , \end{aligned}$$
(3.13)

with \(p=(\gamma +2)/(\gamma +1)\), \(\kappa _p(\cdot )\) given in (2.3), and

$$\begin{aligned} \exp \left\{ a(n,y;w/\phi )\right\} = \left( \frac{(w/\phi )^{\gamma +1} y^\gamma }{(p-1)^\gamma (2-p)}\right) ^n \frac{1}{n!\Gamma (n\gamma ) y}. \end{aligned}$$

If we re-parametrize this joint distribution using mean parameter \(\mu =\kappa _p'(\theta )=((1-p)\theta )^{1/(1-p)}\) for total claim costs we arrive at the log-likelihood function

$$\begin{aligned} \ell (\mu ,\phi ) = \left\{ \begin{array}{ll} \frac{w}{\phi }\left( Y\frac{\mu ^{1-p}}{1-p} - \frac{\mu ^{2-p}}{2-p}\right) +N\log \left( \frac{(w/\phi )^{\gamma +1} Y^\gamma }{(p-1)^\gamma (2-p)}\right) - \log \left( N!\Gamma (N\gamma ) Y\right) &\quad \text { for }N>0,\\ - \frac{w}{\phi } \frac{\mu ^{2-p}}{2-p} & \quad \text { for } N=0. \end{array} \right. \end{aligned}$$

In complete analogy with the above we determine the score equations w.r.t. \(\mu\) and \(\phi\)

$$\begin{aligned} \frac{\partial }{\partial \mu } \ell (\mu ,\phi )=0\Leftrightarrow & \frac{w}{\phi }\frac{1}{V(\mu )}\left( Y- \mu \right) =0, \end{aligned}$$
(3.14)
$$\begin{aligned} \frac{\partial }{\partial \phi } \ell (\mu ,\phi )=0\Leftrightarrow & - \frac{w}{\phi ^2}\left( Y\frac{\mu ^{1-p}}{1-p} - \frac{\mu ^{2-p}}{2-p}\right) -\frac{1}{ \phi }\frac{N}{p-1}=0, \end{aligned}$$
(3.15)

with variance function \(V(\mu )=\mu ^p\).

Proposition 3.3

Fisher’s information contribution in the heterogeneous dispersion Tweedie’s CP model w.r.t. \((\mu , \phi )\) is given by

$$\begin{aligned} {{\mathcal {I}}}(\mu , \phi ) = -{{\mathbb {E}}}\left[ \nabla ^2_{(\mu ,\phi )} \ell (\mu ,\phi ) \right] = \left( \begin{array}{cc} \frac{w}{\phi }\frac{1}{V(\mu )} & 0\\ 0& \frac{w \mu ^{2-p}}{(p-1)(2-p)}\frac{1}{\phi ^3} \end{array} \right) . \end{aligned}$$
(3.16)

Moreover, we have

$$\begin{aligned} {{\mathbb {E}}}\left[ \frac{\partial ^2}{\partial \mu \partial p} \ell (\mu ,\phi ) \right] =0. \end{aligned}$$

Remark that in the above proposition we talk about Fisher’s information contribution because the statement considers only one single random variable (YN). This is in contrast to (3.27) where we calculate Fisher’s information matrix over the entire portfolio.

Joint MLE of \(\mu\) and \(\phi\) requires solving score Eqs. (3.14)-(3.15). This can be done by any suitable root search or gradient descent algorithm. In [17], this root search problem is approached using a slightly different representation, namely, by introducing a dispersion response variable D. This allows for a reformulation of the model in a DGLM form. We revisit [17] after proving Proposition 3.3.

Proof of Proposition 3.3

We start by calculating the means of the terms of the score in (3.15). We have

$$\begin{aligned} {{\mathbb {E}}}\left[ Y\frac{\mu ^{1-p}}{1-p} - \frac{\mu ^{2-p}}{2-p}\right] =\frac{\mu ^{2-p}}{1-p} - \frac{\mu ^{2-p}}{2-p}=\frac{1}{1-p}\frac{\mu ^{2-p}}{2-p} =\frac{1}{1-p}\kappa _p(\theta ), \end{aligned}$$

and for the second term we receive

$$\begin{aligned} {{\mathbb {E}}}\left[ \frac{1}{ \phi }\frac{N}{p-1}\right]= & \frac{1}{\phi ^2}\frac{\phi \lambda w}{p-1} =\frac{1}{\phi ^2}\frac{-\theta }{c}\frac{\lambda w}{p-1} =\frac{1}{\phi ^2}\frac{\mu ^{1-p}}{2-p}\frac{\gamma }{c}\frac{\lambda w}{p-1} \\= & \frac{w}{\phi ^2}\frac{1}{p-1}\frac{\mu ^{2-p}}{2-p} =-\frac{w}{\phi ^2}\frac{1}{1-p}\kappa _p(\theta ). \end{aligned}$$

From these two formulas it follows that, indeed, the score in (3.15) is a residual with mean zero. The cross-covariance terms are easily obtained by noting that also the score in (3.14) is a zero mean residual. This implies

$$\begin{aligned} -{{\mathbb {E}}}\left[ \frac{\partial ^2}{\partial \mu \partial \phi } \ell (\mu ,\phi ) \right] = -{{\mathbb {E}}}\left[ \frac{\partial ^2}{\partial \mu \partial p} \ell (\mu ,\phi ) \right] =0. \end{aligned}$$
(3.17)

There remain the diagonal terms. For the first one we have, using integration by parts,

$$\begin{aligned} -{{\mathbb {E}}}\left[ \frac{\partial ^2}{\partial \mu ^2} \ell (\mu ,\phi )\right] = {{\mathbb {E}}}\left[ \left( \frac{\partial }{\partial \mu } \ell (\mu ,\phi )\right) ^2\right] =\frac{w^2}{\phi ^2}\frac{1}{V(\mu )^2}\mathrm{Var}\left( Y\right) =\frac{w}{\phi }\frac{1}{V(\mu )}. \end{aligned}$$

For the second diagonal term we have, this provides the variance of the zero mean score in (3.14),

$$\begin{aligned} -{{\mathbb {E}}}\left[ \frac{\partial ^2}{\partial \phi ^2} \ell (\mu ,\phi )\right] = -{{\mathbb {E}}}\left[ 2 \frac{w}{\phi ^3}\left( Y\frac{\mu ^{1-p}}{1-p} - \frac{\mu ^{2-p}}{2-p}\right) +\frac{N(\gamma +1)}{ \phi ^2} \right] = \frac{w}{\phi ^3}\frac{\mu ^{2-p}}{(p-1)(2-p)}. \end{aligned}$$

This finishes the proof of Proposition 3.3. \(\square\)

Thus, for MLE of \(\mu\) and \(\phi\) we need to consider the scores in (3.14)–(3.15), the latter one defining (unscaled) residuals w.r.t. the dispersion given by

$$\begin{aligned} {{\mathcal {E}}}_d= \frac{\partial }{\partial \phi } \ell (\mu ,\phi ) =\frac{1}{\phi ^2}\left[ -w\left( Y\frac{\mu ^{1-p}}{1-p} -\frac{\mu ^{2-p}}{2-p}\right) -~\phi \frac{N}{p-1}\right] . \end{aligned}$$

As mentioned above, solving score Eqs. (3.14)–(3.15) produce the MLEs for \(\mu\) and \(\phi\); basically, this finishes the MLE problem. In the remainder of this section, following [17], we rewrite this MLE problem. This different representation introduces a new (dispersion) response variable D, such that the root search problem can directly be related to Fisher’s scoring method in a DGLM form. Choose square variance function \(V_d(\phi )=\phi ^2\) and dispersion-prior weights

$$\begin{aligned} v = \frac{2w}{\phi } \frac{\mu ^{2-p}}{(p-1)(2-p)}~>~0. \end{aligned}$$
(3.18)

This allows us to define so-called dispersion responses

$$\begin{aligned} D = \frac{2}{v} \left( -w \left( Y\frac{\mu ^{1-p}}{1-p} - \frac{\mu ^{2-p}}{2-p}\right) -\phi ~ \frac{N}{p-1} \right) +\phi = \frac{2}{v} V_d(\phi ){{\mathcal {E}}}_d +\phi , \end{aligned}$$
(3.19)

having \({{\mathbb {E}}}[D]=\phi\), \(\mathrm{Var}(D)=\frac{2}{v}V_d(\phi )\) and scores w.r.t. \(\phi\)

$$\begin{aligned} \frac{\partial }{\partial \phi } \ell (\mu ,\phi ) =\frac{v}{2} \frac{1}{V_d(\phi )}\left( D-\phi \right) . \end{aligned}$$
(3.20)

Fisher’s information contribution (3.16) then reads as

$$\begin{aligned} {{\mathcal {I}}}(\mu , \phi ) = -{{\mathbb {E}}}\left[ \nabla ^2_{(\mu ,\phi )} \ell (\mu ,\phi ) \right] = \left( \begin{array}{cc} \frac{w}{\phi }\frac{1}{V(\mu )} & 0\\ 0& \frac{v}{2}\frac{1}{V_d(\phi )} \end{array} \right) . \end{aligned}$$

As emphasized by [16], orthogonality of \(\mu\) and \((\phi ,p)\), see (3.17), typically leads to fast convergence in estimation algorithms.

Remarks 3.4

  • We start from the joint distribution of (NY), given in (3.13), for estimating \((\mu , \phi )\). This estimation problem is modified by considering a new response vector (YD), instead. The new dispersion response D, defined in (3.19), is not gamma distributed, but in view of score (3.20) we bring it into a gamma EDF structure with weight \(v>0\), dispersion parameter 2 and square variance function \(V_d(\phi )=\phi ^2\), see also (2.5). In [17] it is mentioned that these definitions of v and D are somewhat artificial, but they bring this estimation problem into a DGLM form; note that this requires to include one dispersion term \(\phi\) into the weight v and the response D, this means that we have an approximate score equation equivalence with a gamma MLE problem. In view of Proposition 3.3, we could also define dispersion response D differently by choosing an inverse Gaussian power variance function, i.e. \(V_d(\phi )=\phi ^3\), and defining the dispersion-prior weight correspondingly. This provides the same numerical solution for MLE, using an approximate score equation equivalence with an inverse Gaussian MLE problem. However, in this latter version the weights do not provide the right scaling for a distribution within the EDF.

  • Alternatively, we could try to estimate dispersion \(\phi\) using Tweedie’s deviance residuals

    $$\begin{aligned} {{\mathcal {E}}} = \mathrm{sgn}(Y-\mu ) \sqrt{2w \left( Y \frac{Y^{1-p}-\mu ^{1-p}}{1-p}-\frac{Y^{2-p}-\mu ^{2-p}}{2-p}\right) }. \end{aligned}$$

    Following [17], the squared residuals \({{\mathcal {E}}}^2\) are approximately \(\phi \chi _1^2\) distributed for \(\phi\) sufficiently small, thus, they can be approximated by a gamma distribution with mean \(\phi\) and variance \(2\phi ^2\). Section 3.1 of [17] discusses this estimation approach. We do not further follow these lines because this approach does not use any claim count information and, therefore, does not benefit from full information (NY) as the CPG case.

  • There is a third alternative of including a dispersion estimation, and this third one is the one implemented in the R package dglm. This requires that the dispersion parameter is made policy dependent and then a DGLM is explored on \((Y,{{\mathcal {E}}})\) by alternating the corresponding score updates. Also this approach does not benefit from full information (NY) (in contrast to the CPG model), and it is therefore not further explored in this manuscript.

3.2.3 Double generalized linear model in the heterogeneous Tweedie case

We use the heterogeneous dispersion Tweedie’s CP approach and bring it into a DGLM form as described in the previous section. Choosing a suitable link function \(g_p(\cdot )\) we make the following regression assumption for the linear predictor of the mean

$$\begin{aligned} g_p(\mu _i)= g_p({{\mathbb {E}}}[ Y_i]) =g_p(\kappa _p'(\theta _i))= \eta _i = \left\langle \varvec{\beta }^*,{\varvec{x}}^*_i \right\rangle , \end{aligned}$$
(3.21)

upper indices \(^*\) distinguishing the parametrization in Tweedie’s CP GLM case from the individual models in Sect. 3.1. For the modeling of the dispersion parameter we choose a second link function \(g_d(\cdot )\) such that we have the linear predictor

$$\begin{aligned} g_d(\phi _i) = \left\langle \varvec{\alpha }^*, {\varvec{z}}^*_i \right\rangle , \end{aligned}$$
(3.22)

where the covariates \({\varvec{z}}^*_i \in {{\mathcal {Z}}}^*\subset \{1\}\times {{\mathbb {R}}}^{q^*}\) are potentially differently pre-processed than the ones \({\varvec{x}}^*_i \in {{\mathcal {X}}}^*\subset \{1\}\times {{\mathbb {R}}}^{d^*}\), but still belong to the same policy i. MLE of \((\varvec{\beta }^*,\varvec{\alpha }^*)\) requires solving the score equations, see (3.14) and (3.20),

$$\begin{aligned} \nabla _{\varvec{\beta }^*} \ell (\varvec{\beta }^*,\varvec{\alpha }^*) ={\varvec{0}}\Leftrightarrow & \sum _{i=1}^n \frac{w_i}{\phi _i}\frac{1}{V(\mu _i)}\left( Y_i- \mu _i\right) \nabla _{\varvec{\beta }^*} \mu _i ={\mathfrak {X}}'W_p {\varvec{R}}={\varvec{0}},\end{aligned}$$
(3.23)
$$\begin{aligned} \nabla _{\varvec{\alpha }^*} \ell (\varvec{\beta }^*,\varvec{\alpha }^*)={\varvec{0}}\Leftrightarrow & \sum _{i=1}^n \frac{v_{i}}{2} \frac{1}{V_d(\phi _i)}\left( D_i-\phi _i\right) \nabla _{\varvec{\alpha }^*} \phi _i ={\mathfrak {Z}}'W_d {\varvec{R}}_d={\varvec{0}}, \end{aligned}$$
(3.24)

with design matrices \({\mathfrak {X}}=({\varvec{x}}^*_1,\ldots , {\varvec{x}}^*_n)'\) and \({\mathfrak {Z}}=({\varvec{z}}^*_1,\ldots , {\varvec{z}}^*_n)'\), working weight matrices

$$\begin{aligned} W_p= & \mathrm{diag} \left( \left( \frac{\partial g_p(\mu _i)}{\partial \mu _i}\right) ^{-2}\frac{w_i}{\phi _i}\frac{1}{V(\mu _i)} \right) _{i=1,\ldots , n},\\ W_d= & \mathrm{diag} \left( \left( \frac{\partial g_d(\phi _i)}{\partial \phi _i}\right) ^{-2}\frac{v_i}{2}\frac{1}{V_d(\phi _i)} \right) _{i=1,\ldots , n}, \end{aligned}$$

and working residual vectors \({\varvec{R}}= (\frac{\partial g_p(\mu _i)}{\partial \mu _i}(Y_i-\mu _i))_{i=1,\ldots , n}\) and \({\varvec{R}}_d= (\frac{\partial g_d(\phi _i)}{\partial \phi _i}(D_i-\phi _i))_{i=1,\ldots , n}\). For the definition of the dispersion-prior weights \(v_i=v_i(\phi _i)\) and the dispersion responses \(D_i\) we refer to (3.18)–(3.19). Using Fisher’s scoring method for estimating \(\varvec{\beta }^*\) and \(\varvec{\alpha }^*\), see “Appendix A”, we explore the scoring updates

$$\begin{aligned} \varvec{\beta }^*_t\mapsto & \varvec{\beta }^*_{t+1}=\left( {\mathfrak {X}}'W_p{\mathfrak {X}}\right) ^{-1} {\mathfrak {X}}' W_p \left( {\varvec{R}}+ g_p(\varvec{\mu })\right) , \end{aligned}$$
(3.25)
$$\begin{aligned} \varvec{\alpha }^*_t\mapsto & \varvec{\alpha }^*_{t+1}=\left( {\mathfrak {Z}}'W_d{\mathfrak {Z}}\right) ^{-1} {\mathfrak {Z}}' W_d \left( {\varvec{R}}_d + g_d(\varvec{\phi })\right) , \end{aligned}$$
(3.26)

where all terms on the right-hand side are evaluated at algorithmic time t, that is, \(W_p=W_p(\varvec{\beta }^*_t, \varvec{\alpha }^*_t)\), \(W_d=W_d(\varvec{\alpha }^*_t)\), \({\varvec{R}}={\varvec{R}}(\varvec{\beta }^*_t)\), \({\varvec{R}}_d={\varvec{R}}_d(\varvec{\beta }^*_t, \varvec{\alpha }^*_t)\), \(g_p(\varvec{\mu })=g_p(\varvec{\mu }(\varvec{\alpha }^*_t))\) and \(g_d(\varvec{\phi })=g_d(\varvec{\phi }(\varvec{\alpha }^*_t))\). This also indicates how the two sets of parameters interact. Since parameters \(\varvec{\beta }^*\) and \(\varvec{\alpha }^*\) are orthogonal, alternating the updates leads to fast convergence. Standard errors are obtained from the inverse of Fisher’s information matrix

$$\begin{aligned} {{\mathcal {I}}}(\varvec{\beta }^*, \varvec{\alpha }^*) = \left( \begin{array}{cc} {\mathfrak {X}}'W_p{\mathfrak {X}} & 0\\ 0& {\mathfrak {Z}}'W_d{\mathfrak {Z}} \end{array} \right) . \end{aligned}$$
(3.27)

There remains estimation of p. This is usually done by considering the profile log-likelihood for p, given optimal estimates of \((\varvec{\beta }^*, \varvec{\alpha }^*)\), that is, we study \(p\mapsto \ell (\widehat{\varvec{\beta }}^*(p), \widehat{\varvec{\alpha }}^*(p),p)\) where, in general, the MLEs \(\widehat{\varvec{\beta }}^*(p)\) and \(\widehat{\varvec{\alpha }}^*(p)\) depend on the explicit choice of the power variance parameter p; for an example of a profile log-likelihood we refer to Fig. 1, below.

Remarks 3.5

  • We emphasize that covariates may be chosen and pre-processed differently in the CPG and in Tweedie’s CP models; this is indicated by choosing different notation for the covariate spaces \(({{\mathcal {X}}}, {{\mathcal {Z}}})\) and \(({{\mathcal {X}}}^*, {{\mathcal {Z}}}^*)\), respectively. Different pre-processing of covariates might be necessary because we aim at optimally modeling different responses in the two models. This optimal modeling also includes good choices of link functions which may even imply that a CPG GLM does not lead to a Tweedie CP DGLM counterpart (or vice versa) because the linear predictor structure does not necessarily carry through general choices of link functions. In Sect. 3.3 we fully rely on log-links which allow for a one-to-one identification scheme between the different GLM frameworks.

  • The calculation of the terms of Fisher’s information matrix involving p are a bit cumbersome, for this reason we do not give them explicitly.

  • As usual in MLE, typically, the dispersion parameters \(\phi _i\) will be under-estimated because MLE is not unbiased for variance parameter estimation, we refer to [17], Sects. 3.2 and 4.3. Using both total claim costs Y and claim counts N, the bias is often small, see [17].

We close this subsection by considering the special case of log-links for \(g_p\) and \(g_d\). This special choice provides working weight matrices \(W_p\) and \(W_d\)

$$\begin{aligned} W_p = \mathrm{diag} \left( \frac{w_i}{\phi _i}\mu _i^{2-p} \right) _{i=1,\ldots , n}=(p-1)(2-p)~ \mathrm{diag}\left( \frac{v_i}{2} \right) _{i=1,\ldots , n} =(p-1)(2-p) W_d, \end{aligned}$$

and working residual vectors \({\varvec{R}}= ((Y_i/\mu _i-1))_{i=1,\ldots , n}\) and \({\varvec{R}}_d= ((D_i/\phi _i-1))_{i=1,\ldots , n}\). This provides us with score equations

$$\begin{aligned} \nabla _{\varvec{\beta }^*} \ell (\varvec{\beta }^*,\varvec{\alpha }^*)={\varvec{0}}\Leftrightarrow & {\mathfrak {X}}'W_p {\varvec{R}}={\varvec{0}}, \end{aligned}$$
(3.28)
$$\begin{aligned} \nabla _{\varvec{\alpha }^*} \ell (\varvec{\beta }^*,\varvec{\alpha }^*)={\varvec{0}}\Leftrightarrow & {\mathfrak {Z}}'W_p {\varvec{R}}_d={\varvec{0}}, \end{aligned}$$
(3.29)

thus, in both cases we can use the same working weight matrix \(W_p\).

Theorem 3.6

Assume Tweedie’s CP DGLM holds with covariate spaces \({{\mathcal {X}}}^*={{\mathcal {Z}}}^*\) and covariate choices \({\varvec{x}}^*_i={\varvec{z}}^*_i\) for all insurance policies \(i=1,\ldots , n\). Moreover, assume that for both GLMs we choose log-links for \(g_p\) and \(g_d\). The MLE \(\widehat{\varvec{\beta }}^*\) of \(\varvec{\beta }^*\) does not depend on the explicit choice of the power variance parameter \(p\in (1,2)\), and also the corresponding mean estimates \({\widehat{\mu }}_i=\exp \langle \widehat{\varvec{\beta }}^*,{\varvec{x}}^*_i\rangle\) are p-independent. Assume that \({\widehat{\mu }}_i\) and \({\widehat{\phi }}_i(p)\) solve the score Eqs. (3.28)–(3.29) for power variance parameter \(p \in (1,2)\). The dispersion parameter estimates scale as a function of power variance parameters \(q \in (1,2)\) as

$$\begin{aligned} {\widehat{\phi }}_i(q) ~= ~\frac{2-p}{2-q}~{\widehat{\phi }}_i(p)~ {\widehat{\mu }}_i^{p-q} \qquad \text { for all insurance policies } i. \end{aligned}$$

Remarks 3.7

  • Theorem 3.6 is a very useful and strong result. In general, we have to run Fisher’s scoring method for every power variance parameter \(p\in (1,2)\) to find optimal MLEs \(\widehat{\varvec{\beta }}^*(p)\) and \(\widehat{\varvec{\alpha }}^*(p)\). In a second step, the optimal power variance parameter is found by considering the profile log-likelihood in p. Under the assumptions of Theorem 3.6 we only need to run Fisher’s scoring method once to receive MLEs \(\widehat{\varvec{\beta }}^*\) and \(\widehat{\varvec{\alpha }}^*(p)\) for a fixed power variance parameter p. All dispersion estimates for different power variance parameters are then directly obtained from Theorem 3.6, and mean parameter estimates do not vary in p. That is, we can directly maximize function \(q \mapsto \ell ({\widehat{\mu }}_i, {\widehat{\phi }}_i(q),q)\) where the dispersion \({\widehat{\phi }}_i(q)\) scales in q according to Theorem 3.6.

  • Theorem 3.6 also highlights that the heterogeneous dispersion case is fundamentally different from the homogeneous one. The mean estimates in the homogeneous case depend on the choice of the power variance parameter p through the working weight matrix \(W_p\) in (3.12). In contrast to the heterogeneous dispersion case, a constant dispersion parameter does not leave any room to balance different p’s through portfolio varying dispersions. On the other hand, under the assumptions of Theorem 3.6, the mean estimates are not p sensitive, which is equivalent to the CPG case.

Proof of Theorem 3.6

The score equations for \(\varvec{\beta }^*\) and \(\varvec{\alpha }^*\) are under log-link choices provide, see (3.14)–(3.15),

$$\begin{aligned} \nabla _{\varvec{\beta }^*} \ell (\varvec{\beta }^*,\varvec{\alpha }^*)= & \sum _{i=1}^n \frac{w_i}{\phi _i}\mu _i^{1-p}\left( Y_i- \mu _i\right) {\varvec{x}}_i^*~=~{\varvec{0}},\\ \nabla _{\varvec{\alpha }^*} \ell (\varvec{\beta }^*,\varvec{\alpha }^*)= & - \sum _{i=1}^n \frac{w_i}{\phi _i} \left( Y_i\frac{\mu _i^{1-p}}{1-p} - \frac{\mu _i^{2-p}}{2-p}-\frac{\phi _i}{1-p} \frac{N_i}{w_i} \right) {\varvec{x}}_i^*~=~{\varvec{0}}. \end{aligned}$$

Assume that \(\mu _i=\mu _i(p)\) and \(\phi _i =\phi _i(p)\) solve the above score equations for given power variance parameter \(p \in (1,2)\). Next, we choose power variance parameter \(q\ne p\), and define \({\widetilde{\phi }}_i = k \phi _i \mu _i^{p-q}\) for some \(k>0\). We plug \(\mu _i\) and \({\widetilde{\phi }}_i\) into the first score equation for power variance parameter q

$$\begin{aligned} \sum _{i=1}^n \frac{w_i}{{\widetilde{\phi }}_i}\mu _i^{1-q}\left( Y_i- \mu _i\right) {\varvec{x}}_i^*=\frac{1}{k}\sum _{i=1}^n \frac{w_i}{{\phi }_i}\mu _i^{1-p}\left( Y_i- \mu _i\right) {\varvec{x}}_i^*= {\varvec{0}}, \end{aligned}$$

thus, the pairs \((\mu _i,{\widetilde{\phi }}_i)\) fulfill the first score equation. We now need to massage these pairs through the second score equation for power variance parameter q

$$\begin{aligned}&- \sum _{i=1}^n \frac{w_i}{{\widetilde{\phi }}_i} \left( Y_i\frac{\mu _i^{1-q}}{1-q} - \frac{\mu _i^{2-q}}{2-q}-\frac{{\widetilde{\phi }}_i}{1-q} \frac{N_i}{w_i} \right) {\varvec{x}}_i^*\\&\quad =\frac{-1}{k} \sum _{i=1}^n \frac{w_i}{\phi _i } \left( Y_i\frac{\mu _i^{1-p}}{1-q} - \frac{\mu _i^{2-p}}{2-q}-\frac{k \phi _i}{1-q} \frac{N_i}{w_i} \right) {\varvec{x}}_i^*\\&\quad = \frac{p-1}{1-q} \sum _{i=1}^n \frac{w_i}{\phi _i } \left( Y_i\frac{\mu _i^{1-p}}{1-p}\frac{1}{k} - \frac{\mu _i^{2-p}}{2-p}\frac{2-p}{2-q}\frac{1-q}{k (1-p)} -\frac{\phi _i}{1-p} \frac{N_i}{w_i} \right) {\varvec{x}}_i^*. \end{aligned}$$

Next we apply that the pairs \((\mu _i,\phi _i)\) solve the score equations for p. This provides for the score function of \(\varvec{\alpha }^*\)

$$\begin{aligned}&- \sum _{i=1}^n \frac{w_i}{{\widetilde{\phi }}_i} \left( Y_i\frac{\mu _i^{1-q}}{1-q} - \frac{\mu _i^{2-q}}{2-q}-\frac{{\widetilde{\phi }}_i}{1-q} \frac{N_i}{w_i} \right) {\varvec{x}}_i^*\nonumber \\&\quad = \frac{p-1}{1-q} \sum _{i=1}^n \frac{w_i}{\phi _i } \left( Y_i\frac{\mu _i^{1-p}}{1-p}\left( \frac{1}{k}-1\right) - \frac{\mu _i^{2-p}}{2-p}\left( \frac{2-p}{2-q}\frac{1-q}{k (1-p)}-1\right) \right) {\varvec{x}}_i^*. \nonumber \\&\quad =\frac{1-k}{k} \frac{1}{q-1} \sum _{i=1}^n \frac{w_i}{\phi _i }\mu _i^{1-p} \left( Y_i - \mu _i \frac{(2-p)(1-q)-k(1-p)(2-q)}{(2-q)(2-p)(1-k)} \right) {\varvec{x}}_i^*. \end{aligned}$$
(3.30)

Now we still have one parameter \(k>0\) that we can choose. We require

$$\begin{aligned} \frac{(2-p)(1-q)-k(1-p)(2-q)}{(2-q)(2-p)(1-k)}=1 \quad \Leftrightarrow \quad k = \frac{2-p}{2-q}. \end{aligned}$$

This choice implies that (3.30) is equal to zero which follows from the fact that the pairs \((\mu _i,\phi _i)\) solve the score equations for \(\varvec{\beta }^*=\varvec{\beta }^*(p)\). This finishes the proof. In Remarks 3.10 we give a shorter proof. \(\square\)

3.3 Relation between the two GLM approaches

We compare the CPG model to its counterpart being parametrized through Tweedie’s CP model. To start off, recall formulas (2.6)–(2.9). The first formula gives relationship \(p=(\gamma +2)/(\gamma +1)\in (1,2)\). Since these two parameters are not modeled insurance policy dependent, we directly identify them. We start with the gamma claim size GLM of Sect. 3.1 using identification (2.7). The means are given by, see (3.5),

$$\begin{aligned} g_2^{-1} \left\langle \varvec{\alpha },{\varvec{z}}\right\rangle ~=~ \zeta ~=~\frac{\gamma }{c}~{\mathop =\limits ^{(2.7)}}~\frac{\gamma \phi }{-\theta } ~=~\frac{(2-p)\phi }{\mu ^{1-p}} ~=~(2-p)\frac{g_d^{-1}\left\langle \varvec{\alpha }^*, {\varvec{z}}^*\right\rangle }{\left( g_p^{-1}\left\langle \varvec{\beta }^*,{\varvec{x}}^*\right\rangle \right) ^{1-p}}, \end{aligned}$$
(3.31)

where we have used canonical link \(\theta =(\kappa _p')^{-1}(\mu )=-\mu ^{1-p}/(p-1)\). From identification (2.9) we have

$$\begin{aligned} \exp \left\langle \varvec{\beta },{\varvec{x}}\right\rangle ~=~\lambda ~{\mathop =\limits ^{(2.9)}}~ \frac{1}{\phi }\kappa _p(\theta ) ~=~(2-p)^{-1} \frac{\left( g_p^{-1}\left\langle \varvec{\beta }^*,{\varvec{x}}^*\right\rangle \right) ^{2-p}}{g_d^{-1}\left\langle \varvec{\alpha }^*, {\varvec{z}}^*\right\rangle }. \end{aligned}$$
(3.32)

From identities (3.31)–(3.32) we conclude that for general link functions it is non-trivial to derive one parametrization from the other, i.e. this requires quite some feature engineering to bring the models in line (if possible at all). If we choose log-links for \(g_2\), \(g_p\) and \(g_d\) (these are not the canonical links in all three cases but they are convenient because they preserve the right sign convention on the canonical scale) we can directly compare the linear predictors

$$\begin{aligned} \left\langle \varvec{\beta },{\varvec{x}}\right\rangle= & -\log (2-p)-\left\langle \varvec{\alpha }^*, {\varvec{z}}^*\right\rangle +(2-p)\left\langle \varvec{\beta }^*,{\varvec{x}}^*\right\rangle ,\\ \left\langle \varvec{\alpha },{\varvec{z}}\right\rangle= & \log (2-p)+\left\langle \varvec{\alpha }^*, {\varvec{z}}^*\right\rangle -(1-p)\left\langle \varvec{\beta }^*,{\varvec{x}}^*\right\rangle . \end{aligned}$$

Formulating this differently gives us the following theorem.

Theorem 3.8

Assume all link functions in (3.2), (3.5), (3.21) and (3.22) are chosen to be the log-links. The CPG GLM having constant shape parameter \(\gamma >0\) and Tweedie’s CP DGLM with variance parameter \(p=(\gamma +2)/(\gamma +1)\in (1,2)\) can be identified by (i.e. the resulting two models are equal under) the following equations for the linear predictors

$$\begin{aligned} \left\langle \varvec{\beta }^*,{\varvec{x}}^*\right\rangle= & \left\langle \varvec{\beta },{\varvec{x}}\right\rangle + \left\langle \varvec{\alpha },{\varvec{z}}\right\rangle ,\\ \left\langle \varvec{\alpha }^*, {\varvec{z}}^*\right\rangle= & -\log (2-p) -(p-1)\left\langle \varvec{\beta },{\varvec{x}}\right\rangle + (2-p)\left\langle \varvec{\alpha },{\varvec{z}}\right\rangle . \end{aligned}$$

Conclusions and Remarks 3.9

  • If we have found a good parametrization for the Poisson claim counts GLM and the gamma claim size GLM involving covariates \({\varvec{x}}\in {{\mathcal {X}}}\) and \({\varvec{z}}\in {{\mathcal {Z}}}\), then Tweedie’s CP model should include all components present in \({\varvec{x}}\cup {\varvec{z}}\), and \({\varvec{x}}^*\) and \({\varvec{z}}^*\) should only differ if some components of \({\varvec{x}}\cup {\varvec{z}}\) cancel out by a particular choice of regression parameters \(\varvec{\beta }\) and \(\varvec{\alpha }\). The same holds true if we exchange the roles of the two models.

  • From the second identity of Theorem 3.8 we see that dispersion \(\phi _i\) is constant over all policies i if and only if

    $$\begin{aligned} (p-1)\left\langle \varvec{\beta }^{(-0)},{\varvec{x}}_i^{(-0)}\right\rangle = (2-p)\left\langle \varvec{\alpha }^{(-0)},{\varvec{z}}_i^{(-0)}\right\rangle \qquad \text { for all { i},} \end{aligned}$$
    (3.33)

    the upper indices \(^{(-0)}\) indicate that we exclude the intercept components \(x_{i,0}=z_{i,0}=1\) in these scalar products. Identity (3.33) gives the condition under which the assumptions of Sect. 3.2.1 are justified. However, in many practical insurance pricing examples, we find that the covariate space \({{\mathcal {Z}}}\) for claim sizes is strictly smaller than \({{\mathcal {X}}}\) used for claim counts modeling because certain factors only influence claim frequencies but are not significant for claim severities. In addition, often there are covariates that have opposite signs for claim counts and claim sizes. In all these cases (3.33) is not satisfied, and working under a constant dispersion assumption cannot be justified.

  • We believe that covariate pre-processing is more easily done within the CPG model. The reason being, as stated above, that claim counts and claim sizes often behave differently w.r.t. covariate information. Covariate spaces \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) allow us to explore such differences individually. In Tweedie’s CP model everything is merged together which makes it more difficult to choose good covariates and to separate the different systematic effects.

  • Tweedie’s CP model calibrated with MLE will typically differ from the corresponding CPG model if we follow Theorem 3.8. The CPG model involves \(|{{\mathcal {X}}}|+|{{\mathcal {Z}}}|=d+q+2\) parameters. This typically results in a Tweedie CP model with \(|{{\mathcal {X}}}^*|+|{{\mathcal {Z}}}^*|=2|{{\mathcal {X}}}^*|\) parameters, which is bigger than \(d+q+2\) if \({{\mathcal {X}}}\ne {{\mathcal {Z}}}\). Thus, in Tweedie’s CP model there are more parameters to be estimated if we follow the above guidance.

We close this section by giving the log-likelihoods of Tweedie’s CP DGLM and of the CPG GLM under log-link choices. The one of Tweedie’s CP DGLM is given by

$$\begin{aligned} \ell _\mathrm{Tw}(\varvec{\beta }^*, \varvec{\alpha }^*)= & \sum _{i=1}^n w_i \left( Y_i\frac{ e^{(1-p)\left\langle \varvec{\beta }^*,{\varvec{x}}^*_i\right\rangle -\left\langle \varvec{\alpha }^*,{\varvec{z}}^*_i\right\rangle }}{1-p} - \frac{e^{(2-p)\left\langle \varvec{\beta }^*,{\varvec{x}}^*_i\right\rangle -\left\langle \varvec{\alpha }^*,{\varvec{z}}^*_i\right\rangle }}{2-p}\right) - \frac{N_i}{p-1} \left\langle \varvec{\alpha }^*,{\varvec{z}}^*_i\right\rangle \nonumber \\&+ ~N_i \log \left( \frac{w_i^{\gamma +1} Y_i^\gamma }{(p-1)^\gamma (2-p)}\right) - \log \left( N_i!\Gamma (N_i\gamma ) Y_i\right) . \end{aligned}$$
(3.34)

To make the log-likelihood of the CPG GLM directly comparable to (3.34), we make a change of variables \((N_i,{\bar{Z}}_i)\mapsto (N_i, Y_i)\) by setting \(Y_i= N_i {\bar{Z}}_i /w_i\). This gives us log-likelihood

$$\begin{aligned} \ell _\mathrm{CPG}(\varvec{\beta }, \varvec{\alpha })= & \sum _{i=1}^n w_i \left( Y_i \frac{e^{\log (2-p)-\left\langle \varvec{\alpha },{\varvec{z}}_i\right\rangle }}{1-p} - \frac{e^{\log (2-p)+\left\langle \varvec{\beta },{\varvec{x}}_i\right\rangle }}{2-p} \right) - \frac{N_i}{p-1} (2-p)\left\langle \varvec{\alpha },{\varvec{z}}_i\right\rangle \nonumber \\&+~ \frac{N_i}{p-1} (p-1)\left\langle \varvec{\beta },{\varvec{x}}_i\right\rangle + N_i \log \left( w_i^{\gamma +1}\gamma ^\gamma Y_i^\gamma \right) - \log \left( N_i!\Gamma (\gamma N_i)Y_i\right) . \end{aligned}$$
(3.35)

Assuming covariate relationship \({\varvec{x}}^*_i={\varvec{z}}^*_i\) we can re-parametrize the first log-likelihood (3.34) by setting \(\varvec{\beta }^+=(2-p)\varvec{\beta }^*- \varvec{\alpha }^*\) and \(\varvec{\alpha }^+=-(1-p)\varvec{\beta }^*+ \varvec{\alpha }^*\), this gives us (we drop irrelevant terms)

$$\begin{aligned} \ell _\mathrm{Tw}(\varvec{\beta }^+, \varvec{\alpha }^+)\propto & \sum _{i=1}^n w_i \left( Y_i\frac{ e^{-\left\langle \varvec{\alpha }^+,{\varvec{x}}^*_i\right\rangle }}{1-p} - \frac{e^{\left\langle \varvec{\beta }^+,{\varvec{x}}^*_i\right\rangle }}{2-p}\right) - \frac{N_i}{p-1} \left\langle \left[ (2-p) \varvec{\alpha }^+-(p-1) \varvec{\beta }^+\right] ,{\varvec{x}}^*_i\right\rangle . \end{aligned}$$

This proves under \({\varvec{x}}_i^*={\varvec{z}}_i^*={\varvec{x}}_i\cup {\varvec{z}}_i\) that the CPG model is nested in Tweedie’s CP model and we have for given \(p=(\gamma +2)/(\gamma +1)\)

$$\begin{aligned} \underset{(\varvec{\beta }^*, \varvec{\alpha }^*)}{\max }~ \ell _\mathrm{Tw}(\varvec{\beta }^*, \varvec{\alpha }^*) ~\ge ~ ~ \underset{(\varvec{\beta }, \varvec{\alpha })}{\max } ~\ell _\mathrm{CPG}(\varvec{\beta }, \varvec{\alpha }), \end{aligned}$$
(3.36)

this explicitly uses that we have the same data representation \((N_i,Y_i)_i\) in both log-likelihoods.

Remarks 3.10

  • Under the assumptions of Theorem 3.8 and additionally assuming that \({\varvec{x}}_i={\varvec{z}}_i={\varvec{x}}^*_i={\varvec{z}}^*_i\), we receive an identity in (3.36). Since the mean estimates in the CPG case do not depend on the particular choice of the shape parameter \(\gamma\), the same must hold true for Tweedie’s CP DGLM model under identical covariates \({\varvec{x}}_i={\varvec{z}}_i={\varvec{x}}^*_i={\varvec{z}}^*_i\). Using Proposition 2.1 we then receive the dispersion scaling of Theorem 3.6, thus, this gives us a second shorter proof for Theorem 3.6.

  • If \({\varvec{x}}_i^*={\varvec{z}}_i^*={\varvec{x}}_i\cup {\varvec{z}}_i\) and \({\varvec{x}}_i \ne {\varvec{z}}_i\), the CPG model is strictly nested in Tweedie’s CP model and, in general, we do not get an identity in (3.36). In that case, Theorem 3.8 reflects an ideal world because noise in the data prevents MLE estimated parameters (estimated separately in both models) from strictly satisfying the identities in Theorem 3.8.

  • To perform model selection in the general case we can use Akaike’s information criterion (AIC) [1]. This corrects both sides of (3.36) by the number of regression parameters involved, thus, with AIC the model with the smaller value should be preferred from either

    $$\begin{aligned}&-2\underset{(\varvec{\beta }^*, \varvec{\alpha }^*)}{\max }~ \ell _\mathrm{Tw}(\varvec{\beta }^*, \varvec{\alpha }^*) +2(d^*+ q^*+ 2) \quad \text { or} \nonumber \\&\quad -2\underset{(\varvec{\beta }, \varvec{\alpha })}{\max } ~\ell _\mathrm{CPG}(\varvec{\beta }, \varvec{\alpha }) +2(d+q+2). \end{aligned}$$
    (3.37)

    AIC applies because in both models we use the same data representation \((N_i,Y_i)_i\) and both models are evaluated in the MLEs for the corresponding parameters. We emphasize that for estimating the CPG GLM we use in (3.36) sufficient statistics \(Y_i=N_i{\bar{Z}}_i/w_i\). If, instead, we use the individual claim sizes \(Z_{i,j}\) to estimate the CPG GLM, AIC does not apply because the log-likelihoods to be compared use the available data in different ways.

4 Numerical examples

We study two numerical examples to benchmark the two modeling approaches of Theorem 3.8. First, we design a synthetic data example that fully meets the assumptions of Theorem 3.8. Thus, there is no model uncertainty involved in this first (synthetic) example about underlying distributions, covariate spaces and link functions, and we can fully focus on estimating parameters with MLE in the CPG GLM and in Tweedie’s CP DGLM. These results are then compared to neural network regression approaches on the same synthetic data. In contrast to GLMs, neural networks explore optimal covariate selection themselves. This is done in Sect. 4.2.2. Our second example in Sect. 4.3 is a real data example. This additionally raises the issue of model uncertainty because the real data has not been generated by a CPG model. Both examples are based on the motorcycle insurance data swmotorcycle used in [11], this data is available through the R package CASdatasets [3], see Listing 1 for an excerpt of the data. For the synthetic data we sample a portfolio of covariates from the original data, and then generate claims with a CPG GLM designed according to the assumptions of Theorem 3.8. For the real data example we fully rely on the swmotorcycle data and we use the corresponding claim observations.

4.1 Description of motorcyle data

We briefly describe the data, for more information we refer to “Appendix B”, below. The data comprises comprehensive insurance for motorcycles which covers loss or damage of motorcycles other than collision, for instance, caused by theft, fire or vandalism. The data is aggregated on insurance policy level for years 1994–1998. The data is shown in Listing 1. We have applied some pre-processing, e.g., we have dropped all policies that have an exposure equal to zero.

figure a

We briefly describe the variables, the following enumeration refers to lines 2–10 of Listing 1:

  1. 2.

    Age: age of motorcycle owner in \(\{18,\ldots , 70\}\) years (we cap at 70 because of scarcity above);

  2. 3.

    Gender: gender of motorcycle owner either being Female or Male;

  3. 4.

    Zone: seven geographical Swedish zones being (1) central parts of Sweden’s three largest cities, (2) suburbs and middle-sized towns, (3) lesser towns except those in zones (5)–(7), (4) small towns and countryside except those in zones (5)–(7), (5) Northern towns, (6) Northern countryside, and (7) Gotland (Sweden’s largest island);

  4. 5.

    McClass: seven ordered motorcycle classes received from the so-called EV ratio defined as (Engine power in kW \(\times\) 100)/(Vehicle weight in kg \(+\) 75 kg);

  5. 6.

    McAge: age of motorcycle in \(\{0,\ldots , 30\}\) years (we cap at 30 because of scarcity beyond);

  6. 7.

    Bonus: ordered bonus-malus class from 1 to 7, entry level is 1;

  7. 8.

    Exposure: total exposure in yearly units in interval [0.0274, 31.3397], the shortest entry referring to 1 day and the longest one to more than 31 years;Footnote 1

  8. 9.

    ClaimNb: number of claims N on the policy;

  9. 10.

    ClaimCosts: total claim costs \(S=\sum _{j=1}^N Z_j\) on the policy (thus, we do not have information about individual claims \(Z_j\) but only about sufficient statistics \({\bar{Z}}\) on each policy).

The data is illustrated in “Appendix B”.

4.2 Synthetic data example

This section is based on synthetic (simulated) data from a CPG GLM.

4.2.1 A generalized linear model approach

We start by describing the simulation of the synthetic data. We randomly choose \(n=250'000\) insurance policies from dat=swmotorcycle using the R code:

$$\begin{aligned} \mathtt{portfolio ~<-~ dat[ sample( x = c(1:nrow(dat)), size = 250{,}000, replace = TRUE ), ]} \end{aligned}$$

Based on this portfolio we generate claims (NY) using two GLMs that fulfill the CPG assumptions of Theorem 3.8, the modeling details are specified in columns 1–3 of Table 1. We especially emphasize that the covariate spaces \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) differ for claim counts and claim sizes.

Table 1 Synthetic CPG GLM example: the first 3 columns show the chosen (true) model; column ‘estimated CPG’ shows the resulting MLEs (with estimated std.dev. brackets)

CPG GLM We estimate the Poisson claim counts GLM and the gamma claim amounts GLM separately, according to Sect. 3.1 and under log-link choices. The results are presented in column ‘estimated CPG’ of Table 1, the brackets provide one estimated standard deviation received from the inverse of Fisher’s information matrix. Note that we can estimate all regression parameters \(\beta _k\) and \(\alpha _k\) without specifying shape parameter \(\gamma >0\) explicitly. Most estimated parameters are within one standard deviation of the true parameter values. The true parameters have been chosen such that they resemble the true data swmotorcycle. The true data has an observed claim frequency of only 1.05%, see “Appendix B”. In the present example, claims are scarce too, and the gamma claim size GLM has been estimated on (only) 2’795 claims. The parameter estimates are remarkably accurate (we do not have model uncertainty here, only parameter estimation uncertainty). We conclude that this model can be calibrated well using the separate approach for claim counts and claim amounts.

Fig. 1
figure 1

(lhs) Log-likelihood \(\gamma \mapsto \ell (\widehat{\varvec{\alpha }}, \gamma )\) of the gamma GLM to estimate shape parameter \(\gamma\) for given \(\widehat{\varvec{\alpha }}\); (rhs) Tweedie profile log-likelihood \(p\mapsto \ell (\widehat{\varvec{\beta }}^*, \widehat{\varvec{\alpha }}^*(p),p)\) to estimate p

Figure 1 (lhs) considers the log-likelihood function \(\gamma \mapsto \ell (\widehat{\varvec{\alpha }}, \gamma )\) of the gamma GLM to estimate shape parameter \(\gamma\), we also refer to score Eq. (3.9). From this we find MLE \({\widehat{\gamma }}=1.56\), and the inverse of Fisher’s information matrix provides an estimated standard deviation of 0.04 for this estimate. Thus, the estimated shape parameter is slightly too high, though still within two standard deviations of the true value of \(\gamma =1.5\). We again highlight that this estimate is based on only 2’795 claims. Moreover, we remark that \({\widehat{\gamma }}\) has been used in the standard deviation estimates of Table 1, see (3.8).

Table 2 Synthetic example: the first 3 columns show the chosen (true) model; column ‘estimated CPG’ (in italic) shows estimated parameters from the CPG model; column ‘estimated Tweedie’s CP’ shows the MLEs from Tweedie’s CP model (with estimated std.dev. in brackets)

Tweedie’s DGLM Next we turn our attention to Tweedie’s CP case. The true values \(\varvec{\beta }\), \(\varvec{\alpha }\) and \(\gamma\) as well as their MLE counterparts \(\widehat{\varvec{\beta }}\), \(\widehat{\varvec{\alpha }}\) and \({\widehat{\gamma }}\) from the CPG model are transformed with Theorem 3.8 to receive the same model in Tweedie’s CP parametrization, this is illustrated in the first four columns of Table 2. In a first calibration step for Tweedie’s CP model, we choose \(p=1.39\) which is the optimal power variance parameter estimate of the CPG model, see last line in column 4 of Table 2. We then calibrate Tweedie’s CP DGLM model for this power variance parameter p with Fisher’s scoring method (3.25)–(3.26); as starting values for the algorithm we use the estimates from the CPG model (in italic in Table 2). Fisher’s scoring method converges in 7 iterations with these initial values. Due to (3.36) we receive a model that has a bigger log-likelihood than its CPG counterpart (we include all constants in this consideration so that the log-likelihoods are directly comparable).

In the next step, we optimize over the power variance parameter p. Therefore, we use Theorem 3.6, which says that the mean estimates \({\widehat{\mu }}_i\) do not depend on p, and which provides the p-scaling for dispersion parameter MLEs \({\widehat{\phi }}_i(p)\). This allows us to directly plot the profile log-likelihood \(p\mapsto \ell (\widehat{\varvec{\beta }}^*, \widehat{\varvec{\alpha }}^*(p),p)\) as a function of \(p \in (1.36, 1.41)\), see Fig. 1 (rhs). From this figure, we find maximizing value \({\widehat{p}}=1.39\), which is close to the true value of \(p=1.4\). The second last column in Table 2 shows the resulting MLEs \(\widehat{\varvec{\beta }}^*\) and \(\widehat{\varvec{\alpha }}^*({\widehat{p}})\) of the optimal Tweedie’s CP model. A first observation is that the parameter estimates from Tweedie’s CP model are not as close to the true values as the MLEs from the CPG model. However, model selection should not be based on this observation: note that the (true) CPG model has 22 parameters and Tweedie’s CP model has 33 parameters, therefore, we expect some differences in model calibration.

Table 3 Synthetic example: summary statistics of fitted CPG and Tweedie’s CP GLMs

We summarize the two estimated models in Table 3. On row (a) we compare the log-likelihoods \(\ell _\mathrm{CPG}(\widehat{\varvec{\beta }},\widehat{\varvec{\alpha }}, {\widehat{p}})\) and \(\ell _\mathrm{Tw}(\widehat{\varvec{\beta }}^*,\widehat{\varvec{\alpha }}^*, {\widehat{p}})\) of the estimated models CPG and Tweedie’s CP, see also (3.36), to the one of the true model \(\ell ({\varvec{\beta }^*},{\varvec{\alpha }^*}, {p})\): we observe that both models slightly overfit to the data, with Tweedie’s CP model having a slightly larger overfit [this is consistent with (3.36)]. Therefore, we penalize in AIC the log-likelihoods of the models by the number of parameters involved, see (3.37). The AIC values are given on row (b) of Table 3, and we give preference to the CPG calibration. Performing a likelihood-ratio test having the CPG model as null hypothesis model nested in Tweedie’s CP model, gives a p-value of 34%, thus, we do not reject the null hypothesis on a 5% significance level. This gives support that we should go for the smaller CPG model in this example. Row (c) of Table 3 gives the rooted mean square error (RMSE) between the true model means \(w_i\mu _i\) and their estimated counterparts \(w_i{\widehat{\mu }}_i=w_i{\widehat{\lambda }}_i{\widehat{\zeta }}_i\); rows (d)–(g) show average means and dispersions as well as the corresponding standard deviations. We observe that these figures match the true values quite well. Recall that these figures are based on one simulation from the true model for each insurance policy, thus, they involve simulation error (but they do not involve model error because we only assume parameters \(\varvec{\beta }\), \(\varvec{\alpha }\) and p as unknown in this example). Moreover, we remark that the dispersion is not under-estimated, here, we also refer to the last bullet point of Remark 3.5.

Fig. 2
figure 2

(lhs) Comparison of estimated means \({\widehat{\mu }}_i\) versus true means \(\mu _i\): CPG GLM (orange) and Tweedie CP DGLM (green), (rhs) CPG GLM means versus Tweedie CP DGLM means over all \(i=1,\ldots , n\) policies

Finally, in Fig. 2 we plot the predicted means \({\widehat{\mu }}_i\) against the true values \(\mu _i\). The left-hand side compares the two estimated models against the true model, and the right-hand side compares the two estimated models against each other. From these plots we conclude that both models are very accurate, the CPG estimated one (orange) being slightly closer to the true model than its Tweedie’s CP counterpart (green). Summarizing: This synthetic example gives evidence supporting industry practice on focusing on the CPG model. Specifying covariate spaces is easier in the CPG case because systematic effects of claim counts and claim amounts are clearly separated, and in our example accuracy is slightly higher because Tweedie’s CP seems to slightly overfit in our example.

4.2.2 A neural network regression approach

Next we explore neural network regression models on the same synthetic data. Neural networks have the capability of representation learning which means that they can perform covariate engineering themselves, we refer to Sections 4 and 5 of [21]. Therefore, covariates can be provided in their raw form to neural networks. The neural networks then, at the same time, pre-process these covariates and predict the response variables. Starting from a GLM, the required changes to achieve this representation learning are comparably small. We illustrate this in the present section. Alternatively, one may also be interested in using generalized additive models (GAMs). GAMs are more flexible in modeling different functional forms in the components of the covariates compared to GLMs, however, they do not automatically allow for flexible interaction modeling between covariate components. For this reason, we favor neural networks over GAMs.

We first define the (raw) covariate space \({{\mathcal {X}}}^\dagger\) which is going to be used throughout this section:

$$\begin{aligned} {\varvec{x}}^\dagger = (\mathtt{Age}, \mathtt{Gender}, \mathtt{Zone}, \mathtt{McClass}, \mathtt{McAge}) ~\in ~ {{\mathcal {X}}}^\dagger ~\subset ~{{\mathbb {R}}}^8, \end{aligned}$$
(4.1)

where we use dummy coding for the categorical variable \(\mathtt{Zone}\in \{0,1\}^4\). In contrast to Table 1, we do not specify the continuous variables in all its functional forms, but we let the neural network find these functional forms. A neural network is a function

$$\begin{aligned} \psi :{{\mathcal {X}}}^\dagger \rightarrow {{\mathbb {R}}}^d, \quad {\varvec{x}}^\dagger \mapsto {\varvec{x}}=\psi ({\varvec{x}}^\dagger ), \end{aligned}$$
(4.2)

that consists of a composition of a fixed number of hidden network layers, each of them having a certain number of hidden neurons. For an explicit mathematical definition we refer to Section 3.1 in [21]. \({\varvec{x}}^\dagger\) has the interpretation of being the raw covariate, and \({\varvec{x}}=\psi ({\varvec{x}}^\dagger ) \in {{\mathbb {R}}}^d\) can be interpreted as the (network) pre-processed covariate. These pre-processed covariates \(\psi ({\varvec{x}}^\dagger )\) are then used in a classical GLM, e.g., for claim counts we may set for the log-link choice, see (3.2),

$$\begin{aligned} \lambda : {{\mathcal {X}}}^\dagger \rightarrow {{\mathbb {R}}}_+,\qquad {\varvec{x}}^\dagger \mapsto \lambda ({\varvec{x}}^\dagger ) = \exp \left\langle \varvec{\beta },\psi ({\varvec{x}}^\dagger ) \right\rangle = \exp \left\{ \beta _0 + \sum _{k=1}^d\beta _k \psi _k({\varvec{x}}^\dagger ) \right\} , \end{aligned}$$
(4.3)

note that we use a slight abuse of notation here because strictly speaking \(\psi ({\varvec{x}}^\dagger )\) does not include an intercept term for \(\beta _0\), so this always needs to be added. Neural network regression function (4.3) involves regression parameters \(\varvec{\beta } \in {{\mathbb {R}}}^{d+1}\) as well as network weights \(\vartheta \in {{\mathbb {R}}}^r\) which parametrize network function \(\psi =\psi _\vartheta\). The dimension r of \(\vartheta\) depends on the complexity of the chosen network \(\psi\). Network fitting now trains at the same time network parameter \(\vartheta\) for an optimal covariate pre-processing as well as GLM parameter \(\varvec{\beta }\) for response prediction. State-of-the-art fitting uses variants of the gradient descent algorithm, and a good performance depends on the complexity of \(\psi\), we just mention the universal approximation property of appropriately designed neural networks. For more information, we refer to the relevant literature, in particular, to [21]. Based on this reference we explore (4.3) and its counterparts for claim counts and Tweedie’s CP model. In all three prediction problems we use the identical covariate space \({{\mathcal {X}}}^\dagger\), and only network function \(\psi\) will differ in the weights \(\vartheta\) to bring covariates into the appropriate form for the corresponding prediction task.

Poisson claim counts We start by modeling claim counts using neural network approach (4.3). We use the R library keras to implement this, and we use exactly the same architecture as in Listing 4 of [14], the only thing that changes is the dimension of \({{\mathcal {X}}}^\dagger\) from 40 on line 1 of Listing 4 in [14] to 8 in the present example, see (4.1). This results in \(r=655\) and \(d+1=11\) parameters. We fit these parameters in the usual way by considering 80% of the data for training and 20% of the data for out-of-sample validation to track overfitting in the gradient descent algorithm (we run 100 epochs on batch size 5000). We then choose the parameter that has the best out-of-sample performance on the validation data. To this network solution we apply the bias regularization step of Listing 5 in [21] to make the model unbiased.

Table 4 Comparison of deviance losses and RMSEs of the true (synthetic) model, the intercept model not using covariate information, the GLM approaches and the neural network approaches

On rows (a1)–(a2) of Table 4 we present the results for the claim counts neural network model. We provide the Poisson deviance losses of the true model \(\lambda _i\) (which is known here because we simulate from this model), the intercept model that does not use covariate information (i.e. is only based on intercept parameter \(\beta _0\)), the claim counts GLM (upper part of Table 1) and its neural network counterpart. We observe that both regression models slightly overfit to the data \(8.4366\cdot 10^{-2}\) and \(8.4393\cdot 10^{-2}\), respectively, compared to the true model loss of \(8.4431\cdot 10^{-2}\).

Fig. 3
figure 3

(lhs) Comparison of estimated models versus true model: (lhs) Poisson claim counts models for \(N_i\); (rhs) gamma claim size models for \({\bar{Z}}_i\)

On row (a2) we provide the RMSE between the true model means \(\lambda _i\) and the estimated ones \({\widehat{\lambda }}_i\). We note that the Poisson GLM has a smaller RMSE than the neural network Poisson regression model. This is not surprising because the Poisson GLM uses the right functional form (no model uncertainty) and only estimates regression parameter \(\varvec{\beta }\) whereas the neural network regression model also determines this functional form for the raw covariates \({\varvec{x}}^\dagger\). In Fig. 3 (lhs) we compare the resulting estimated frequencies to the true ones on all individual insurance policies \(i=1,\ldots , n\). From this plot we conclude that both models do a fairly good job because the dots lie more or less on the diagonal (which reflects the perfect model).

Gamma claim sizes Next we consider a neural network approach for the gamma claim sizes. This essentially means that we replace linear predictor (3.5) by the following neural network predictor (under a log-link choice for \(g_2\))

$$\begin{aligned} \zeta : {{\mathcal {X}}}^\dagger \rightarrow {{\mathbb {R}}}_+,\quad {\varvec{x}}^\dagger \mapsto \zeta ({\varvec{x}}^\dagger ) = \exp \left\langle \varvec{\alpha },\psi ({\varvec{x}}^\dagger ) \right\rangle = \exp \left\{ \alpha _0 + \sum _{k=1}^d\alpha _k \psi _k({\varvec{x}}^\dagger ) \right\} , \end{aligned}$$
(4.4)

where \(\psi\) is a neural network function (4.2) that may have the same structure as the one used for the Poisson regression model (4.3), but typically differs in network weights \(\vartheta\). For simplicity, we use exactly the same neural network architecture as in the Poisson case, only the exposure offset is dropped and the Poisson deviance loss function is changed to the gamma deviance loss function (including weights), in line with the distributional assumptions made.

The results are presented on rows (b1)–(b2) of Table 4 and Fig. 3 (rhs) (we run 1000 epochs on batch size 5000 and we callback the model with the smallest validation loss). Again we receive reasonably good results from the network approach, i.e., covariate engineering on \({{\mathcal {X}}}^\dagger\) is done quite well by the network, we emphasize that these results are based on only 2’795 claims. But we also see from Fig. 3 (rhs) that individual predictions spread more around the diagonal than in the gamma GLM case (where we assume perfect knowledge about the functional form of the regression function). Better accuracy can only be achieved by having more claim observations.

Next, we estimate the shape parameter \(\gamma\). This is done analogously to the gamma GLM case by plotting the corresponding log-likelihood \(\ell (\gamma )\) as a function of \(\gamma\). This gives estimate \({\widehat{\gamma }}=1.57\), which is slightly too large but still reasonable compared to the true value of \(\gamma =1.5\). A too high shape parameter implies a too low dispersion, which is a sign of over-fitting to the observations.

Table 5 Synthetic example: summary statistics of the fitted CPG and Tweedie’s CP neural network models

We conclude with the summary statistics for the neural network approaches in Table 5 column ‘estimated CPG’, which look fairly similar to the GLM ones in Table 3. We obtain a larger RMSE, which is not surprising because we have more model uncertainty due to missing covariate knowledge, this is also obvious from Fig. 3.

Tweedie’s compound Poisson neural network approach First, we remark that, in general, there is no simple comparison between a CPG and a Tweedie CP neural network approach similar to (3.34)–(3.35). The relation (3.34)–(3.35) is strongly based on the fact that we can directly compare linear predictors under suitable choices of covariate spaces. Since the networks given (4.2) transform covariates in a non-trivial way under non-linear activation functions, there is no hope to get an easy comparison between the models unless the network architectures are chosen in a very specific way, i.e. artificial way, so to say. Therefore, we do not aim to nest the CPG neural network into Tweedie’s CP neural network model, but we directly focus on modeling the latter. This essentially implies that we have to replace linear predictors (3.21)–(3.22) by the following two-dimensional neural network predictors (under log-link choices for \(g_p\) and \(g_d\))

$$\begin{aligned} (\mu ,\phi ) : {{\mathcal {X}}}^\dagger \rightarrow {{\mathbb {R}}}^2_+,\quad {\varvec{x}}^\dagger \mapsto (\mu ,\phi )({\varvec{x}}^\dagger ) = \left( \exp \left\langle \varvec{\beta }^*,\psi ({\varvec{x}}^\dagger ) \right\rangle , \exp \left\langle \varvec{\alpha }^*,\psi ({\varvec{x}}^\dagger ) \right\rangle \right) , \end{aligned}$$

where \(\psi\) is a neural network function (4.2). The first component of \((\mu ,\phi )({\varvec{x}}^\dagger ) \in {{\mathbb {R}}}^2_+\) predicts the total claim costs Y and the second component estimates the dispersion parameter \(\phi\). We use one network \(\psi\) to simultaneously perform this prediction task for mean and dispersion parameter. We implement this in the R library keras and we use the same architecture as in Listing 4 of [14], but we need to change the input dimension to 8 and the output dimension to 2. The exposures \(w_i\) are treated as weights as follows

$$\begin{aligned} \ell (\mu ,\phi ) ~\propto ~ \sum _{i=1}^n w_i\left[ \frac{1}{\phi _i}\left( Y_i\frac{\mu _i^{1-p}}{1-p} - \frac{\mu _i^{2-p}}{2-p}\right) -\frac{N_i/w_i}{p-1}\log \phi _i \right] . \end{aligned}$$

This requires a custom made loss function in keras for parameter estimation, the details are provided in Listing 2 in the “Appendix”. We fit this model with the gradient descent algorithm exactly using the same methodology as outlined above (callback of the lowest validation loss model after 100 epochs on batch sizes 5000).

In order to come up with the optimal neural network model we need to fit neural networks for multiple power variance parameters p, because there is no result similar to Theorem 3.6 that allows for a shortcut. Of course, this disadvantages Tweedie’s CP neural network model from a computational point of view. We come up with an optimal power variance parameter estimate of \({\widehat{p}}=1.390\), which yields then the results in the last column of Table 5. From the figures on rows (c)–(g) we conclude that Tweedie’s CP approach is not fully competitive with the CPG fitting. These differences are also illustrated in Fig. 4 with the CPG approach being slightly closer to the true model means. Nevertheless, all these estimates look very reasonable and the estimated neural network seems to capture the crucial features of the true model.

Fig. 4
figure 4

(lhs) comparison of estimated means \({\widehat{\mu }}_i\) versus true means \(\mu _i\): CPG neural network (orange) and Tweedie CP neural network (green), (rhs) CPG network means versus Tweedie CP neural network means over all \(i=1,\ldots , n\) policies

Conclusions from our synthetic data example Our findings support industry practice of focusing on the CPG parametrization. Our estimated models based on this parametrization are closer to the true model than the ones obtained from Tweedie’s CP parametrization. If we work under GLM assumptions we need to pre-process covariates which is easier in the CPG parametrization because systematic effects of claim counts and claim amounts can be separated. If we work under neural network regression models, model calibration is not efficient under Tweedie’s CP parametrization because we need to run gradient descent algorithms on multiple power variance parameters p to find the optimal model. Moreover, in our example, the CPG case leads to more accurate predictive models.

4.3 Real data example: an outlook

In view of the previous example everything seems to be fairly clear. However, our synthetic data is based on the very strong property of having gamma claim sizes with constant shape parameter \(\gamma\) over the whole insurance portfolio. This assumption may be critical in real insurance applications. We briefly analyze it in terms of our real data example given in “Appendix B”, and we give an outlook in case this assumption is not fulfilled. We keep this section very short, and we mainly view it as a motivation to conduct future research in this direction.

There are two possibilities in which the constant shape parameter assumption may fail, either the claim sizes are gamma distributed, but the shape parameter \(\gamma _i\) is also insurance policy i dependent, or the gamma distribution is inappropriate due to that the claim sizes exhibit too heavy tails. We explore this on the real data example provided in “Appendix B”. For this it suffices to focus on the gamma claim size model, i.e. we do not study claim counts in this real data example. Moreover, to minimize covariate pre-processing we explore a gamma neural network regression model on these claim sizes, the chosen model architecture is identical to the one used in (4.4), in particular, it does covariate engineering itself.

Table 6 Comparison of gamma deviance losses on real data: intercept model and gamma neural network regression model
Fig. 5
figure 5

(lhs) Log-likelihood \(\gamma \mapsto \ell (\gamma )\) of the gamma neural network to estimate shape parameter \(\gamma\) on the real data, (rhs) Tukey–Anscombe plot giving gamma deviance residuals against fitted means

Table 6 shows the (in-sample) gamma deviance losses of the intercept model and the neural network regression model. Obviously, the neural network approach has a better performance (note that the network model has been received by a proper training-validation analysis as described above). Using the resulting mean estimates \({\widehat{\zeta }}_i\) we can estimate the (constant) shape parameter \(\gamma\). This is illustrated in Fig. 5 (lhs): we estimate \({\widehat{\gamma }}=0.75\). Thus, we receive a shape parameter smaller than 1, which provides over-dispersion \(1/{\widehat{\gamma }}=1.33>1\), i.e., the estimated gamma densities are strictly decreasing. This fact requires further examination because there might be two situations: either the true shape parameter is smaller than 1 (and everything is fine), or the claim sizes are more heavy tailed than a gamma distribution allows. This is typically compensated by over-dispersion in the estimated model. We analyze this warning signal on our real data.

Fig. 6
figure 6

(rhs) QQ-plot of the estimated gamma model for claim sizes, (rhs) density of real data compared to one simulation from the estimated model

Figure 5 (rhs) gives the Tukey–Anscombe plot of the gamma deviance residuals against the fitted means. This plot supports the model choice because we cannot see any particular structure in the figure, it also supports the constant shape parameter assumption on \(\gamma\). Figure 6 gives the QQ-plot and it compares the observed claims against one simulation from the fitted model. Also these two plots look quite reasonable, one may only question the upper tail of the QQ-plot.

Conclusions The short analysis on the real data has shown that for the motorcycle claims data the gamma claim size model is fairly reasonable, thus, supporting the CPG model. On different data, one may relax the constant shape parameter assumption on \(\gamma\). This may result in a DGLM for gamma claim sizes (which is known in industry) and a Poisson GLM for claim counts. Again this model can easily be fitted in the Poisson-gamma parametrization, however, this approach does not have a Tweedie’s CP counterpart relying on a fixed parameter p, giving more support to the industry preference of choosing the Poisson-gamma parametrization.

5 Conclusion

We have revisited the compound Poisson model with i.i.d. gamma claim sizes. This model allows for two different parametrizations, namely, the Poisson-gamma parametrization and Tweedie’s compound Poisson parametrization. We have provided results for GLMs illustrating when the two parametrizations are identical, and we have provided a theorem that allows for efficient fitting of power variance parameters in Tweedie’s parametrization (under log-link choices for the GLMs).

In the applied section, we have analyzed why the insurance industry gives preference to the Poisson-gamma parametrization. Based on examples, we find that, indeed, this parametrization is easier to fit, and results turn out to be more accurate in our examples. In particular, under neural network regression models we give a clear preference to the Poisson-gamma parametrization because Tweedie’s version does not possess an easy and efficient way in estimating the power variance parameter. That is, the Tweedie version is computationally clearly lacking behind the Poisson-gamma case.

For our real data example it turns out that the gamma claim size model with constant shape parameter is quite reasonable. However, in many other applications this is not the case. Therefore, insurance industry explores double GLMs for a flexible modeling of shape parameters of claim sizes; on the other hand, a case-dependent p modeling in Tweedie’s compound Poisson parametrization is not (easily) feasible. For modeling more heavy tailed claim sizes, mixture models are a promising proposal.