Abstract
The most commonly used regression model in general insurance pricing is the compound Poisson model with gamma claim sizes. There are two different parametrizations for this model: the Poissongamma parametrization and Tweedie’s compound Poisson parametrization. Insurance industry typically prefers the Poissongamma parametrization. We review both parametrizations, provide new results that help to lower computational costs for Tweedie’s compound Poisson parameter estimation within generalized linear models, and we provide evidence supporting the industry preference for the Poissongamma parametrization.
Introduction
The most commonly used regression model in general insurance pricing is the compound Poisson model with gamma claim sizes. Stateoftheart industry practice fits two separate generalized linear models (GLMs) to the two parts of this model, namely, a Poisson GLM to claim counts and a gamma GLM to claim amounts. Both the Poisson and the gamma distributions belong to the exponential dispersion family (EDF). It has been noted by Tweedie [19] that the compound Poisson model with i.i.d. gamma claim sizes itself belongs to the EDF and, in fact, it closes the interval of power variance functions between the Poisson model and the gamma model, see Section 3 in Jørgensen [5]. As a result of Tweedie’s and Jørgensen’s findings we obtain two different parametrizations of the compound Poisson model with i.i.d. gamma claim sizes. Selection between these two different parametrizations has been explored in the work of Jørgensen–de Souza [6] in the context of GLM insurance pricing. Interestingly, to predict total claim amounts we need to fit two GLMs in the compound Poissongamma parametrization, whereas one GLM is sufficient to get the corresponding predictions within Tweedie’s EDF parametrization. This indicates that in GLM applications these two parametrizations are not fully consistent. This point has been raised by Smyth–Jørgensen [17] who propose to use a double generalized linear model (DGLM) in Tweedie’s parametrization to simultaneously model mean and dispersion parameters within the EDF.
The main purpose of this article is to revisit the work of Smyth–Jørgensen [17], and to give properties under which the two GLMs for claim counts and claim sizes and the DGLM with Tweedie’s EDF parametrization lead to the same predictive model; this involves a discussion about choices of covariate spaces and GLM link functions. Based on this, our first main contribution provides a new result for Tweedie’s DGLM that substantially reduces computational costs in calibrations of power variance parameters.
The second point that we explore is whether the insurance industry’s preference of using the Poissongamma parametrization can be justified. A priori it is not clear whether either of the two ways lead to better predictive models. This part of our work is based on GLMs and on their neural network extensions. We receive evidence that supports the industry preference, in particular, under the choice of neural network regression models the Poissongamma parametrization is simpler in calibration and leads to more robust results.
We close this introduction with a number of remarks. First, we mention the recent survey paper of Quijano Xacur–Garrido [12], which has similar goals to the present paper. This survey only considers the single GLM case of Tweedie’s parametrization, similar to Jørgensen–de Souza [6]. We emphasize that the full picture can only be obtained by comparing the Poissongamma parametrization to the DGLM case introduced in Smyth–Jørgensen [17]. Therefore, we revisit and extend this latter reference to receive a comprehensive comparison. Our view is supported by examples. These examples provide a proof of concept for situations with claims that are not too heavy tailed. However, these examples also highlight the weaknesses of this model on real insurance data, which often exhibits heavier tails than what is suitable under a gamma assumption. We remark that in our discussion we use the terminology of general insurance pricing, however, as commonly the case in general insurance, all our findings can be translated onetoone to claims reserving problems.
Organization of the paper In Sect. 2 we introduce the compound Poisson model with i.i.d. gamma claim sizes and we derive its corresponding Tweedie parametrization. In Sect. 3 we embed both approaches into a GLM framework. We present the two GLMs needed for the Poissongamma parametrization, and we discuss a single GLM and a DGLM parametrization for Tweedie’s approach. Our main results, Theorems 3.6 and 3.8, give conditions under which the different GLM parametrizations lead to identical predictive models. These theorems provide a remarkable property that allows us to lower calibration costs in Tweedie’s DGLMs. In Sect. 4 we give insights and intuition based on numerical examples both under GLMs and neural network regression models. In Sect. 5 we conclude, and the “Appendix” gives a short summary of GLMs and describes the data used.
Tweedie’s compound Poisson model
In Sect. 2.1 we introduce the compound Poisson model with i.i.d. gamma claim sizes, and in Sect. 2.2 we revisit its Tweedie counterpart. For simplicity, in these two sections, we think of using these models for modeling one single insurance policy only. In Sect. 3, below, we consider multiple insurance policies also allowing for heterogeneity between policies.
Compound Poisson model with i.i.d. gamma claim sizes
Let N be the number of claims and let \((Z_j)_{j\ge 1}\) be the corresponding claim sizes. We assume that the number of claims, N, is Poisson distributed with mean \(\lambda w\), where \(\lambda >0\) is the expected claim frequency relative to a given exposure \(w>0\); we write \(N\sim \mathrm{Poi}(\lambda w)\). We assume that the claim sizes \(Z_j\), \(j\ge 1\), are i.i.d. and independent of N having a gamma distribution with shape parameter \(\gamma >0\) and scale parameter \(c>0\); we write \(Z_1 \sim {{\mathcal {G}}}(\gamma ,c)\) for this gamma distribution. The moment generating function of the gamma claim sizes is given by, see Section 3.2.1 in [20],
The compound Poisson model with i.i.d. gamma claim sizes (CPG) is then defined by \(S=\sum _{j=1}^N Z_j\); we use notation \(S\sim \mathrm{CPG}(\lambda w, \gamma , c)\). The moment generating function of S is given by
we refer to Proposition 2.11 in [20].
Tweedie’s compound Poisson model
Following [5, 6, 17, 19] we select a particular model within the EDF. A random variable Y belongs to the EDF if its density has the following form (w.r.t. a \(\sigma\)finite measure on \({{\mathbb {R}}}\))
with \(w>0\) is a given exposure (weight, volume), \(\phi >0\) is the dispersion parameter, \(\theta \in \varvec{\Theta }\) is the canonical parameter in the effective domain \(\varvec{\Theta }\), \(\kappa :\varvec{\Theta }\rightarrow {{\mathbb {R}}}\) is the cumulant function, \(a(\cdot ; \cdot )\) is the normalization, not depending on the canonical parameter \(\theta.\)
For properties of the EDF we refer to “Appendix A”, below. Tweedie’s compound Poisson (CP) model is obtained by choosing for \(p \in (1,2)\) the cumulant function
We use notation \(Y\sim \mathrm{Tweedie}(\theta , w, \phi , p)\). The first two derivatives of the cumulant function provide the first two moments of Y, see also (A.1) in the “Appendix”,
Hyperparameter \(p\in (1,2)\) allows us to model the power variance functions \(V(\mu )=\mu ^p\) between the Poisson boundary case \(p=1\) and the gamma boundary case \(p=2\), we refer to Sect. 3.1, below, for the boundary cases. \(\mu \mapsto \theta =(\kappa _p')^{1}(\mu )\) gives the canonical link of Tweedie’s CP model.
We calculate the moment generating function of the exposure scaled Tweedie’s CP random variable wY, see also Corollary 7.21 in [20],
Note that this is a CPG model in a different parametrization; we call the model under this EDF parametrization Tweedie’s CP model. The following proposition follows by comparing the corresponding moment generating functions.
Proposition 2.1
Choose \(S \sim \mathrm{CPG}(\lambda w, \gamma , c)\) and \(Y\sim \mathrm{Tweedie}(\theta , w, \phi , p)\). We have identity in distribution \(S/w{\mathop =\limits ^\mathrm{(d)}}Y\) under parameter identification
Formula (2.8) can be rewritten in different ways. We have, using the canonical link of Tweedie’s CP model, \(\theta =(\kappa _p')^{1}(\mu ) = \mu ^{1p}/(1p)\) and \(\kappa _p(\theta )=\kappa _p ((\kappa _p')^{1}(\mu ))=\mu ^{2p}/{(2p)}\). This implies, using (2.7) in the second step and (2.6) in the last step,
The latter says that, of course, the expected claim frequency \(\lambda\) is obtained by dividing the expected total claim amount \({{\mathbb {E}}}[Y]=\mu\) by the average claim size \({{\mathbb {E}}}[Z_1]=\gamma /c\).
Thus, under parameter identification scheme (2.6)–(2.8) the two models are identical:
This illustrates that there is a onetoone correspondence between the CPG parametrization and Tweedie’s CP parametrization, i.e. the two models are identical and only differ in interpretation of parameters. The next section will demonstrate that these subtle differences can be crucial for GLM regression modeling, and resulting models can be rather different as functions of explanatory covariates, see Sect. 3.3 below.
Generalized linear models and parameter estimation
In this section we study multiple insurance policies \(i=1,\ldots , n\) having claim distributions \(\mathrm{CPG}(\lambda _i w_i, \gamma , c_i)\) and \(\mathrm{Tweedie}(\theta _i, w_i, \phi _i, p)\), respectively. We allow for heterogeneity between the policies in all parameters that have a lower index i. We describe modeling and parameter estimation within GLMs: we consider two GLMs to model \(\lambda _i\) (Poisson) and \(c_i/\gamma\) (gamma) in the former case, and we consider a DGLM to model \(\theta _i\) and \(\phi _i\) in the latter case. There is a slight difference between “two GLMs” and “double GLM”, the former considers two independent GLMs, the latter does a simultaneous consideration of two GLMs. The volumes \(w_i\) are assumed to be known and do not need any modeling. The shape parameter \(\gamma >0\) and the power variance parameter \(p=(\gamma +2)/(\gamma +1)\), see (2.6), are assumed to be the same for all policies i, this is a standard assumption in stateoftheart use of these GLMs. An overview of GLMs and their parameter estimation within the EDF is given in “Appendix A”.
Compound Poisson model with i.i.d. gamma claim sizes
We begin with the CPG model. Since the loglikelihood function of the CPG model decouples into two separate parts for claim counts and claim sizes, maximum likelihood estimation (MLE) of claim counts and claim size models can be done independently from each other. We start from n independent random variables \(S_i\sim \mathrm{CPG}(\lambda _iw_i,\gamma ,c_i)\) with
The joint loglikelihood function of this model, given observations \((N_i)_i\) and \((Z_{i,j})_{i,j}\) and weights \((w_i)_i\), is given by
where the term on the second line is zero for \(N_i=0\). Remark that in this loglikelihood function (for parameter estimation) we treat \((N_i)_i\) and \((Z_{i,j})_{i,j}\) as known observations; for notational convenience we do not use small fonts for observations. From (3.1) we now see that we can estimate the Poisson parameters \(\lambda _i\) and the gamma parameters \(\gamma\) and \(c_i\) independently from each other; the former uses observations \((N_i)\) and the latter observations \((N_i)_i\) and \((Z_{i,j})_{i,j}\).
Furthermore, we assume that each insurance policy \(i=1,\ldots , n\) is established with covariate information \({\varvec{x}}_i =(x_{i,0},\ldots , x_{i,d})'\in {{\mathcal {X}}} \subset \{1\}\times {{\mathbb {R}}}^{d}\), having initial component \(x_{i,0}= 1\) for modeling the intercept component.
GLM for claim counts: Assume that the expected frequencies \(\lambda _i=\lambda ({\varvec{x}}_i)\) of policies \(i=1,\ldots , n\) can be modeled by a loglinear regression function
with regression parameter \(\varvec{\beta }=(\beta _0,\ldots , \beta _d)'\in {{\mathbb {R}}}^{d+1}\). Assuming that the design matrix \({\mathfrak {X}}=({\varvec{x}}_1,\ldots , {\varvec{x}}_n)'\in {{\mathbb {R}}}^{n\times (d+1)}\) has full rank \(d+1\) we find the unique MLE \(\widehat{\varvec{\beta }}\) for \(\varvec{\beta }\) by the (unique) solution of
Remark that the Poisson distribution has an EDF representation with cumulant function \(\kappa (\cdot )=\kappa _1(\cdot )=\exp \{\cdot \}\). The lower index \(p=1\) in the cumulant function \(\kappa _1(\cdot )\) indicates that we have variance function \(V(\lambda )=\lambda\) in the Poisson case, see also (2.5). The choice (3.2) corresponds to the canonical link \((\kappa _1')^{1}(\cdot )=\log (\cdot )\) in the Poisson GLM. The choice of the canonical link implies that we receive an unbiased portfolio estimate, see [21]. The score Eq. (3.3) is solved numerically, for details see (A.3) in “Appendix A”.
GLM for gamma claim sizes: Consider only insurance policies i which have claims, i.e. with \(N_i>0\). All subsequent considerations in this paragraph are conditional on \(N_i\). The average claim amount on policy i has a conditional gamma distribution
with shape parameter \(\gamma N_i\) and scale parameter \(c_i N_i\) (note that \(\gamma\) is not policy i dependent). This gamma distributed random variable has conditional mean and variance given by
This model belongs to the EDF (2.2) with cumulant function \(\kappa _2(\theta )=\log (\theta )\) for \(\theta \in \varvec{\Theta } ={{\mathbb {R}}}_\), dispersion parameter \(\phi =1/\gamma\) and exposure \(w_i=N_i\). The conditional mean and variance are
This is the boundary case \(p=2\) in Tweedie’s CP model with power variance function \(V(\zeta )=\zeta ^2\), see (2.5).
We set up a second GLM for gamma claim size modeling. This second GLM does not necessarily need to rely on the same covariate space \({{\mathcal {X}}}\) as the Poisson GLM (3.2) for claim counts modeling. To emphasize this point, we introduce a new covariate space containing covariate information \({\varvec{z}}_i =(z_{i,0},\ldots , z_{i,q})'\in {{\mathcal {Z}}} \subset \{1\}\times {{\mathbb {R}}}^{q}\) having initial component \(z_{i,0}= 1\) modeling the intercept. We interpret the choices \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) as follows: both covariates \({\varvec{x}}_i \in {{\mathcal {X}}}\) and \({\varvec{z}}_i \in {{\mathcal {Z}}}\) should belong to the same insurance policy i, however, inclusion of individual covariate components and preprocessing of these components may differ in the two different regression models. This is a result of aiming at optimizing the predictive performance of both regression models.
We make the following regression assumption: choose a suitable link function \(g_2(\cdot )\) to receive the linear predictor, see also “Appendix A”,
for regression parameter \(\varvec{\alpha } \in {{\mathbb {R}}}^{q+1}\). Formula (3.5) explains the relationship between mean \(\zeta ={{\mathbb {E}}}[ {\bar{Z}} N]=\kappa _2'(\theta )\), canonical parameter \(\theta\) and linear predictor \(\eta =\eta ({\varvec{z}})\). Usually, one does not select the canonical link in the gamma GLM because the negativity constraint on the canonical parameter \(\theta \in \varvec{\Theta } ={{\mathbb {R}}}_\) may be too restrictive in choosing a linear functional regression form; this is in contrast to the Poisson GLM (3.2). Therefore, the choice of the link function \(g_2(\cdot )\) has to be done carefully, because we require \(1/\theta _i = g_2^{1}(\eta _i)=g_2^{1}\left\langle \varvec{\alpha }, {\varvec{z}}_i \right\rangle <0\) for all policies \(i=1,\ldots , n\), otherwise the canonical parameter \(\theta _i\) is not in the effective domain \(\varvec{\Theta }\). Below, we will choose the loglink for \(g_2\), which is a common choice for gamma GLMs.
These choices imply for the loglikelihood function, only considering policies \(i=1,\ldots , m\) with \(N_i>0\),
The MLE \(\widehat{\varvec{\alpha }}\) of \(\varvec{\alpha }\) is found by solving the score equation, see “Appendix A”,
with design matrix \({\mathfrak {Z}}=({\varvec{z}}_1,\ldots , {\varvec{z}}_m)'\in {{\mathbb {R}}}^{m\times (q+1)}\), diagonal working weight matrix (using \(V(\zeta _i)= \zeta _i^{2}\))
and with working residual vector \({\varvec{R}}= (\frac{\partial g_2(\zeta _i)}{\partial \zeta _i}({\bar{Z}}_i \zeta _i))_{i=1,\ldots , m}\).
Remarks 3.1

Shape parameter \(\gamma\) may be treated as a hyperparameter, and the explicit choice of \(\gamma\) does not influence parameter estimation because it cancels in the score Eq. (3.7).

MLE (3.6)–(3.7) is expressed in sufficient statistics \({\bar{Z}}_i\), and we receive the same regression parameter estimate \(\widehat{\varvec{\alpha }}\) if we perform MLE directly on the individual claim sizes \(Z_{i,j}\). This is an important property, namely, the gamma GLM can be fit solely on the number of claims \(N_i\) and the total claim amount \({\bar{Z}}_i\) on each policy i. Moreover, this estimated model still allows us to simulate individual claim sizes \(Z_{i,j}\). Thus, GLM regression parameter estimation does not differ whether we consider total claim amounts \({\bar{Z}}_i\) or individual claim sizes \(Z_{i,j}\). On the other hand, the process of model and variable selection might give different results in the two estimation cases (\({\bar{Z}}_i\) vs. \(Z_{i,j}\)) because the loglikelihood functions and the estimates for \(\gamma\) differ, this is, e.g., important for model selection using likelihood ratio tests or Akaike’s information criterion, see Remarks 3.10, below.

If we model claim counts and claim sizes separately, we use maximal available information \(N_i\) and \(Z_{i,j}\). Moreover, we can design covariate spaces \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) in an optimal way, and independently from each other.

If (3.5) is not based on the canonical link of the gamma model, the balance property will not be fulfilled, see [22]. This should be corrected by shifting the intercept parameter \(\beta _0\) correspondingly. Often one chooses the loglink for \(g_2(\cdot )\), under the loglink choice we can also reformulate the regression problem by replacing the average claim amount response (3.4) by the (conditional) total claim amount \(S_i_{\{N_i\}}\) and treating \(\log (N_i)\) as a known offset in the linear predictor.

Shape parameter \(\gamma <1\) leads to an overdispersed model with strictly decreasing density, and for \(\gamma >1\) the density is unimodal. Above \(\gamma\) is treated as a hyperparameter, and below we discuss MLE of \(\gamma\).

If shape parameter \(\gamma _i\) needs explicit modeling as a function of i, then (3.7)–(3.8) will no longer have such a simple structure, and MLE of \(\varvec{\alpha }\) will depend on the explicit choices of \(\gamma _i\). In this case, one can either use a gamma DGLM or one can rely on the 2dimensional exponential family. The latter model is less tractable numerically. It considers cumulant function \(\kappa (\theta _1,\theta _2) = \log \Gamma (\theta _2)\theta _2\log (\theta _1)\) for scale parameter \(c=\theta _1>0\) and shape parameter \(\gamma =\theta _2>0\). This gives inverse link function, see [21],
$$\begin{aligned} \nabla _{(\theta _1,\theta _2)} \kappa (\theta _1,\theta _2)=\left( \frac{\theta _2}{\theta _1}, \frac{\Gamma '(\theta _2)}{\Gamma (\theta _2)}\log (\theta _1)\right) ', \end{aligned}$$the first component being the mean of the gamma distributed random variable Z, and the second component being the mean of \(\log (Z)\). We do not further follow up this approach because we would lose the connection to Tweedie’s CP approach with a policy independent power variance parameter, see next section.
There remains estimation of shape parameter \(\gamma\) for given MLE \(\widehat{\varvec{\alpha }}\). One could either use Pearson’s dispersion estimate for \(1/\gamma\) or directly calculate the MLE of \(\gamma\). In view of (3.6), the MLE is obtained from score equation \(\frac{\partial }{\partial \gamma } \ell (\widehat{\varvec{\alpha }}, \gamma ) =0\), which yields
where we set \({\widehat{\theta }}_i= \exp \{\langle \widehat{\varvec{\alpha }},{\varvec{z}}_i\rangle \}\). Either we solve this score equation numerically using the NewtonRaphson algorithm, or we plot the onedimensional loglikelihood function \(\gamma \mapsto \ell (\widehat{\varvec{\alpha }},\gamma )\) and determine the MLE \({\widehat{\gamma }}\) from this plot, see Fig. 1, below, for an example.
We conclude by calculating Fisher’s information matrix for \((\varvec{\alpha },\gamma )\) in our gamma GLM. We have, see “Appendix A”,
For the second derivative of the \(\gamma\) term we have
where the second order derivative \(\psi '(x)=\frac{d^2}{dx^2}\log \Gamma (x)\) of the loggamma function is known as the trigamma function, see [10, Sec. 5.15]. The trigamma function is directly available in the statistical software R [13]. For the offdiagonal terms we have
This gives us the following Fisher’s information matrix for the gamma claim size modeling
Tweedie’s compound Poisson generalized linear model
Homogeneous dispersion case
From Sect. 2.2 we know that Tweedie’s CP model belongs to the EDF, thus, GLM is straightforward. In this subsection we start with the homogeneous dispersion parameter \(\phi >0\) case; this case will not be supported in Remarks 3.2, below. We assume having n independent random variables \(Y_i \sim \mathrm{Tweedie}(\theta _i, w_i, \phi , p)\), and we choose hyperparameter \(p=(\gamma +2)/(\gamma +1) \in (1,2)\) to make Tweedie’s CP model consistent with the CPG case, see Proposition 2.1. Choosing a suitable link function \(g_p(\cdot )\) we make the following regression assumption for the linear predictor
where \({\varvec{x}}^*_i \in {{\mathcal {X}}}^*\subset \{1\}\times {{\mathbb {R}}}^{d^*}\) are the covariates of policy i and \(\varvec{\beta }^*\) is the regression parameter. We change the covariate notation compared to Sect. 3.1 because covariate preprocessing might be done differently for Tweedie’s CP model compared to the CPG case (because we consider different responses). In complete analogy with the above, MLE requires solving the score equations
with design matrix \({\mathfrak {X}}=({\varvec{x}}^*_1,\ldots , {\varvec{x}}^*_n)'\), diagonal working weight matrix (using \(V(\mu _i)=\mu _i^{p}\))
and working residual vector \({\varvec{R}}= (\frac{\partial g_p(\mu _i)}{\partial \mu _i}(Y_i \mu _i))_{i=1,\ldots , n}\).
Remarks 3.2
There are a couple of crucial differences between Tweedie’s CP approach with homogeneous dispersion \(\phi\) and the CPG approach of the previous section:

1.
The CPG approach of the previous section uses all available information of claim counts \(N_i\) and claim sizes \({\bar{Z}}_{i}\), whereas Tweedie’s CP approach with homogeneous dispersion parameter only uses total claim cost information \(Y_i\).

2.
The former approach allows us to consider different covariate spaces \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) for claim counts and claim size modeling, whereas the latter approach only relies on one version of the covariate space \({{\mathcal {X}}}^*\).

3.
The mean estimates \({\widehat{\lambda }}_i{\widehat{\zeta }}_i\) in the CPG case do not rely on the particular choice of the shape parameter \(\gamma\), whereas in the homogeneous dispersion Tweedie’s CP approach the mean estimates \({\widehat{\mu }}_i\) rely on the specific choice of power variance parameter \(p=(\gamma +2)/(\gamma +1)\) through the working weight matrix \(W_p\), see (3.12).

4.
In general, the dispersions resulting from \(\mathrm{CPG}(\lambda _iw_i,\gamma ,c_i)\) are not constant:
$$\begin{aligned} \mathrm{Var}(S_i/w_i)= & w_i^{2}{{\mathbb {E}}}[N_i]{{\mathbb {E}}}[Z_{i,1}^2]~=~w_i^{1}\lambda _i\left( \frac{\gamma }{c^2_i}+ \frac{\gamma ^2}{c_i^2}\right) ~=~w_i^{1}\lambda _i\zeta _i~\frac{1+\gamma }{c_i}\\= & w_i^{1}\mu _i^p~\frac{\mu _i^{1p}}{c_i (p1)} =w_i^{1}\left( \frac{\theta _i}{c_i}\right) \mu _i^p =\frac{\phi _i}{w_i}~\mu _i^p. \end{aligned}$$The dispersion can only be constant if \(\phi _i=\theta _i/c_i\) does not depend on i. Typically, this is not the case, see also Conclusions and Remarks 3.9, below. Therefore, we need to extend the homogeneous dispersion case of Tweedie’s CP model to a DGLM Tweedie’s CP model, otherwise it cannot be compared to the CPG case, which is more flexible in dispersion modeling. For more analysis of the homogeneous dispersion case see [12].
Heterogeneous dispersion case
As stated in Remarks 3.2, the homogeneous dispersion Tweedie’s CP approach does not use full information of claim counts and claim costs and it does not allow for flexible dispersion modeling \(\phi _i\). In Section 2 of [17], the authors raise the point that in applications of Tweedie’s CP model to insurance claim data it is important to use full information so that also the dispersion parameter \(\phi _i\) is modeled flexibly. As a consequence, the dispersion parameter cannot be factored out as in (3.12), and it does not cancel in optimization (3.11). Therefore, [17] propose to use the framework of DGLMs which was introduced and developed by [8, 15, 18]. DGLMs allow for simultaneous modeling of both mean and dispersion parameters by using a second GLM for the dispersion parameter \(\phi _i\). The two GLMs are jointly calibrated using claim count and claim cost information. The joint density of a single case (N, Y) has been derived in formula (11) of [6]:
with \(p=(\gamma +2)/(\gamma +1)\), \(\kappa _p(\cdot )\) given in (2.3), and
If we reparametrize this joint distribution using mean parameter \(\mu =\kappa _p'(\theta )=((1p)\theta )^{1/(1p)}\) for total claim costs we arrive at the loglikelihood function
In complete analogy with the above we determine the score equations w.r.t. \(\mu\) and \(\phi\)
with variance function \(V(\mu )=\mu ^p\).
Proposition 3.3
Fisher’s information contribution in the heterogeneous dispersion Tweedie’s CP model w.r.t. \((\mu , \phi )\) is given by
Moreover, we have
Remark that in the above proposition we talk about Fisher’s information contribution because the statement considers only one single random variable (Y, N). This is in contrast to (3.27) where we calculate Fisher’s information matrix over the entire portfolio.
Joint MLE of \(\mu\) and \(\phi\) requires solving score Eqs. (3.14)(3.15). This can be done by any suitable root search or gradient descent algorithm. In [17], this root search problem is approached using a slightly different representation, namely, by introducing a dispersion response variable D. This allows for a reformulation of the model in a DGLM form. We revisit [17] after proving Proposition 3.3.
Proof of Proposition 3.3
We start by calculating the means of the terms of the score in (3.15). We have
and for the second term we receive
From these two formulas it follows that, indeed, the score in (3.15) is a residual with mean zero. The crosscovariance terms are easily obtained by noting that also the score in (3.14) is a zero mean residual. This implies
There remain the diagonal terms. For the first one we have, using integration by parts,
For the second diagonal term we have, this provides the variance of the zero mean score in (3.14),
This finishes the proof of Proposition 3.3. \(\square\)
Thus, for MLE of \(\mu\) and \(\phi\) we need to consider the scores in (3.14)–(3.15), the latter one defining (unscaled) residuals w.r.t. the dispersion given by
As mentioned above, solving score Eqs. (3.14)–(3.15) produce the MLEs for \(\mu\) and \(\phi\); basically, this finishes the MLE problem. In the remainder of this section, following [17], we rewrite this MLE problem. This different representation introduces a new (dispersion) response variable D, such that the root search problem can directly be related to Fisher’s scoring method in a DGLM form. Choose square variance function \(V_d(\phi )=\phi ^2\) and dispersionprior weights
This allows us to define socalled dispersion responses
having \({{\mathbb {E}}}[D]=\phi\), \(\mathrm{Var}(D)=\frac{2}{v}V_d(\phi )\) and scores w.r.t. \(\phi\)
Fisher’s information contribution (3.16) then reads as
As emphasized by [16], orthogonality of \(\mu\) and \((\phi ,p)\), see (3.17), typically leads to fast convergence in estimation algorithms.
Remarks 3.4

We start from the joint distribution of (N, Y), given in (3.13), for estimating \((\mu , \phi )\). This estimation problem is modified by considering a new response vector (Y, D), instead. The new dispersion response D, defined in (3.19), is not gamma distributed, but in view of score (3.20) we bring it into a gamma EDF structure with weight \(v>0\), dispersion parameter 2 and square variance function \(V_d(\phi )=\phi ^2\), see also (2.5). In [17] it is mentioned that these definitions of v and D are somewhat artificial, but they bring this estimation problem into a DGLM form; note that this requires to include one dispersion term \(\phi\) into the weight v and the response D, this means that we have an approximate score equation equivalence with a gamma MLE problem. In view of Proposition 3.3, we could also define dispersion response D differently by choosing an inverse Gaussian power variance function, i.e. \(V_d(\phi )=\phi ^3\), and defining the dispersionprior weight correspondingly. This provides the same numerical solution for MLE, using an approximate score equation equivalence with an inverse Gaussian MLE problem. However, in this latter version the weights do not provide the right scaling for a distribution within the EDF.

Alternatively, we could try to estimate dispersion \(\phi\) using Tweedie’s deviance residuals
$$\begin{aligned} {{\mathcal {E}}} = \mathrm{sgn}(Y\mu ) \sqrt{2w \left( Y \frac{Y^{1p}\mu ^{1p}}{1p}\frac{Y^{2p}\mu ^{2p}}{2p}\right) }. \end{aligned}$$Following [17], the squared residuals \({{\mathcal {E}}}^2\) are approximately \(\phi \chi _1^2\) distributed for \(\phi\) sufficiently small, thus, they can be approximated by a gamma distribution with mean \(\phi\) and variance \(2\phi ^2\). Section 3.1 of [17] discusses this estimation approach. We do not further follow these lines because this approach does not use any claim count information and, therefore, does not benefit from full information (N, Y) as the CPG case.

There is a third alternative of including a dispersion estimation, and this third one is the one implemented in the R package dglm. This requires that the dispersion parameter is made policy dependent and then a DGLM is explored on \((Y,{{\mathcal {E}}})\) by alternating the corresponding score updates. Also this approach does not benefit from full information (N, Y) (in contrast to the CPG model), and it is therefore not further explored in this manuscript.
Double generalized linear model in the heterogeneous Tweedie case
We use the heterogeneous dispersion Tweedie’s CP approach and bring it into a DGLM form as described in the previous section. Choosing a suitable link function \(g_p(\cdot )\) we make the following regression assumption for the linear predictor of the mean
upper indices \(^*\) distinguishing the parametrization in Tweedie’s CP GLM case from the individual models in Sect. 3.1. For the modeling of the dispersion parameter we choose a second link function \(g_d(\cdot )\) such that we have the linear predictor
where the covariates \({\varvec{z}}^*_i \in {{\mathcal {Z}}}^*\subset \{1\}\times {{\mathbb {R}}}^{q^*}\) are potentially differently preprocessed than the ones \({\varvec{x}}^*_i \in {{\mathcal {X}}}^*\subset \{1\}\times {{\mathbb {R}}}^{d^*}\), but still belong to the same policy i. MLE of \((\varvec{\beta }^*,\varvec{\alpha }^*)\) requires solving the score equations, see (3.14) and (3.20),
with design matrices \({\mathfrak {X}}=({\varvec{x}}^*_1,\ldots , {\varvec{x}}^*_n)'\) and \({\mathfrak {Z}}=({\varvec{z}}^*_1,\ldots , {\varvec{z}}^*_n)'\), working weight matrices
and working residual vectors \({\varvec{R}}= (\frac{\partial g_p(\mu _i)}{\partial \mu _i}(Y_i\mu _i))_{i=1,\ldots , n}\) and \({\varvec{R}}_d= (\frac{\partial g_d(\phi _i)}{\partial \phi _i}(D_i\phi _i))_{i=1,\ldots , n}\). For the definition of the dispersionprior weights \(v_i=v_i(\phi _i)\) and the dispersion responses \(D_i\) we refer to (3.18)–(3.19). Using Fisher’s scoring method for estimating \(\varvec{\beta }^*\) and \(\varvec{\alpha }^*\), see “Appendix A”, we explore the scoring updates
where all terms on the righthand side are evaluated at algorithmic time t, that is, \(W_p=W_p(\varvec{\beta }^*_t, \varvec{\alpha }^*_t)\), \(W_d=W_d(\varvec{\alpha }^*_t)\), \({\varvec{R}}={\varvec{R}}(\varvec{\beta }^*_t)\), \({\varvec{R}}_d={\varvec{R}}_d(\varvec{\beta }^*_t, \varvec{\alpha }^*_t)\), \(g_p(\varvec{\mu })=g_p(\varvec{\mu }(\varvec{\alpha }^*_t))\) and \(g_d(\varvec{\phi })=g_d(\varvec{\phi }(\varvec{\alpha }^*_t))\). This also indicates how the two sets of parameters interact. Since parameters \(\varvec{\beta }^*\) and \(\varvec{\alpha }^*\) are orthogonal, alternating the updates leads to fast convergence. Standard errors are obtained from the inverse of Fisher’s information matrix
There remains estimation of p. This is usually done by considering the profile loglikelihood for p, given optimal estimates of \((\varvec{\beta }^*, \varvec{\alpha }^*)\), that is, we study \(p\mapsto \ell (\widehat{\varvec{\beta }}^*(p), \widehat{\varvec{\alpha }}^*(p),p)\) where, in general, the MLEs \(\widehat{\varvec{\beta }}^*(p)\) and \(\widehat{\varvec{\alpha }}^*(p)\) depend on the explicit choice of the power variance parameter p; for an example of a profile loglikelihood we refer to Fig. 1, below.
Remarks 3.5

We emphasize that covariates may be chosen and preprocessed differently in the CPG and in Tweedie’s CP models; this is indicated by choosing different notation for the covariate spaces \(({{\mathcal {X}}}, {{\mathcal {Z}}})\) and \(({{\mathcal {X}}}^*, {{\mathcal {Z}}}^*)\), respectively. Different preprocessing of covariates might be necessary because we aim at optimally modeling different responses in the two models. This optimal modeling also includes good choices of link functions which may even imply that a CPG GLM does not lead to a Tweedie CP DGLM counterpart (or vice versa) because the linear predictor structure does not necessarily carry through general choices of link functions. In Sect. 3.3 we fully rely on loglinks which allow for a onetoone identification scheme between the different GLM frameworks.

The calculation of the terms of Fisher’s information matrix involving p are a bit cumbersome, for this reason we do not give them explicitly.

As usual in MLE, typically, the dispersion parameters \(\phi _i\) will be underestimated because MLE is not unbiased for variance parameter estimation, we refer to [17], Sects. 3.2 and 4.3. Using both total claim costs Y and claim counts N, the bias is often small, see [17].
We close this subsection by considering the special case of loglinks for \(g_p\) and \(g_d\). This special choice provides working weight matrices \(W_p\) and \(W_d\)
and working residual vectors \({\varvec{R}}= ((Y_i/\mu _i1))_{i=1,\ldots , n}\) and \({\varvec{R}}_d= ((D_i/\phi _i1))_{i=1,\ldots , n}\). This provides us with score equations
thus, in both cases we can use the same working weight matrix \(W_p\).
Theorem 3.6
Assume Tweedie’s CP DGLM holds with covariate spaces \({{\mathcal {X}}}^*={{\mathcal {Z}}}^*\) and covariate choices \({\varvec{x}}^*_i={\varvec{z}}^*_i\) for all insurance policies \(i=1,\ldots , n\). Moreover, assume that for both GLMs we choose loglinks for \(g_p\) and \(g_d\). The MLE \(\widehat{\varvec{\beta }}^*\) of \(\varvec{\beta }^*\) does not depend on the explicit choice of the power variance parameter \(p\in (1,2)\), and also the corresponding mean estimates \({\widehat{\mu }}_i=\exp \langle \widehat{\varvec{\beta }}^*,{\varvec{x}}^*_i\rangle\) are pindependent. Assume that \({\widehat{\mu }}_i\) and \({\widehat{\phi }}_i(p)\) solve the score Eqs. (3.28)–(3.29) for power variance parameter \(p \in (1,2)\). The dispersion parameter estimates scale as a function of power variance parameters \(q \in (1,2)\) as
Remarks 3.7

Theorem 3.6 is a very useful and strong result. In general, we have to run Fisher’s scoring method for every power variance parameter \(p\in (1,2)\) to find optimal MLEs \(\widehat{\varvec{\beta }}^*(p)\) and \(\widehat{\varvec{\alpha }}^*(p)\). In a second step, the optimal power variance parameter is found by considering the profile loglikelihood in p. Under the assumptions of Theorem 3.6 we only need to run Fisher’s scoring method once to receive MLEs \(\widehat{\varvec{\beta }}^*\) and \(\widehat{\varvec{\alpha }}^*(p)\) for a fixed power variance parameter p. All dispersion estimates for different power variance parameters are then directly obtained from Theorem 3.6, and mean parameter estimates do not vary in p. That is, we can directly maximize function \(q \mapsto \ell ({\widehat{\mu }}_i, {\widehat{\phi }}_i(q),q)\) where the dispersion \({\widehat{\phi }}_i(q)\) scales in q according to Theorem 3.6.

Theorem 3.6 also highlights that the heterogeneous dispersion case is fundamentally different from the homogeneous one. The mean estimates in the homogeneous case depend on the choice of the power variance parameter p through the working weight matrix \(W_p\) in (3.12). In contrast to the heterogeneous dispersion case, a constant dispersion parameter does not leave any room to balance different p’s through portfolio varying dispersions. On the other hand, under the assumptions of Theorem 3.6, the mean estimates are not p sensitive, which is equivalent to the CPG case.
Proof of Theorem 3.6
The score equations for \(\varvec{\beta }^*\) and \(\varvec{\alpha }^*\) are under loglink choices provide, see (3.14)–(3.15),
Assume that \(\mu _i=\mu _i(p)\) and \(\phi _i =\phi _i(p)\) solve the above score equations for given power variance parameter \(p \in (1,2)\). Next, we choose power variance parameter \(q\ne p\), and define \({\widetilde{\phi }}_i = k \phi _i \mu _i^{pq}\) for some \(k>0\). We plug \(\mu _i\) and \({\widetilde{\phi }}_i\) into the first score equation for power variance parameter q
thus, the pairs \((\mu _i,{\widetilde{\phi }}_i)\) fulfill the first score equation. We now need to massage these pairs through the second score equation for power variance parameter q
Next we apply that the pairs \((\mu _i,\phi _i)\) solve the score equations for p. This provides for the score function of \(\varvec{\alpha }^*\)
Now we still have one parameter \(k>0\) that we can choose. We require
This choice implies that (3.30) is equal to zero which follows from the fact that the pairs \((\mu _i,\phi _i)\) solve the score equations for \(\varvec{\beta }^*=\varvec{\beta }^*(p)\). This finishes the proof. In Remarks 3.10 we give a shorter proof. \(\square\)
Relation between the two GLM approaches
We compare the CPG model to its counterpart being parametrized through Tweedie’s CP model. To start off, recall formulas (2.6)–(2.9). The first formula gives relationship \(p=(\gamma +2)/(\gamma +1)\in (1,2)\). Since these two parameters are not modeled insurance policy dependent, we directly identify them. We start with the gamma claim size GLM of Sect. 3.1 using identification (2.7). The means are given by, see (3.5),
where we have used canonical link \(\theta =(\kappa _p')^{1}(\mu )=\mu ^{1p}/(p1)\). From identification (2.9) we have
From identities (3.31)–(3.32) we conclude that for general link functions it is nontrivial to derive one parametrization from the other, i.e. this requires quite some feature engineering to bring the models in line (if possible at all). If we choose loglinks for \(g_2\), \(g_p\) and \(g_d\) (these are not the canonical links in all three cases but they are convenient because they preserve the right sign convention on the canonical scale) we can directly compare the linear predictors
Formulating this differently gives us the following theorem.
Theorem 3.8
Assume all link functions in (3.2), (3.5), (3.21) and (3.22) are chosen to be the loglinks. The CPG GLM having constant shape parameter \(\gamma >0\) and Tweedie’s CP DGLM with variance parameter \(p=(\gamma +2)/(\gamma +1)\in (1,2)\) can be identified by (i.e. the resulting two models are equal under) the following equations for the linear predictors
Conclusions and Remarks 3.9

If we have found a good parametrization for the Poisson claim counts GLM and the gamma claim size GLM involving covariates \({\varvec{x}}\in {{\mathcal {X}}}\) and \({\varvec{z}}\in {{\mathcal {Z}}}\), then Tweedie’s CP model should include all components present in \({\varvec{x}}\cup {\varvec{z}}\), and \({\varvec{x}}^*\) and \({\varvec{z}}^*\) should only differ if some components of \({\varvec{x}}\cup {\varvec{z}}\) cancel out by a particular choice of regression parameters \(\varvec{\beta }\) and \(\varvec{\alpha }\). The same holds true if we exchange the roles of the two models.

From the second identity of Theorem 3.8 we see that dispersion \(\phi _i\) is constant over all policies i if and only if
$$\begin{aligned} (p1)\left\langle \varvec{\beta }^{(0)},{\varvec{x}}_i^{(0)}\right\rangle = (2p)\left\langle \varvec{\alpha }^{(0)},{\varvec{z}}_i^{(0)}\right\rangle \qquad \text { for all { i},} \end{aligned}$$(3.33)the upper indices \(^{(0)}\) indicate that we exclude the intercept components \(x_{i,0}=z_{i,0}=1\) in these scalar products. Identity (3.33) gives the condition under which the assumptions of Sect. 3.2.1 are justified. However, in many practical insurance pricing examples, we find that the covariate space \({{\mathcal {Z}}}\) for claim sizes is strictly smaller than \({{\mathcal {X}}}\) used for claim counts modeling because certain factors only influence claim frequencies but are not significant for claim severities. In addition, often there are covariates that have opposite signs for claim counts and claim sizes. In all these cases (3.33) is not satisfied, and working under a constant dispersion assumption cannot be justified.

We believe that covariate preprocessing is more easily done within the CPG model. The reason being, as stated above, that claim counts and claim sizes often behave differently w.r.t. covariate information. Covariate spaces \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) allow us to explore such differences individually. In Tweedie’s CP model everything is merged together which makes it more difficult to choose good covariates and to separate the different systematic effects.

Tweedie’s CP model calibrated with MLE will typically differ from the corresponding CPG model if we follow Theorem 3.8. The CPG model involves \({{\mathcal {X}}}+{{\mathcal {Z}}}=d+q+2\) parameters. This typically results in a Tweedie CP model with \({{\mathcal {X}}}^*+{{\mathcal {Z}}}^*=2{{\mathcal {X}}}^*\) parameters, which is bigger than \(d+q+2\) if \({{\mathcal {X}}}\ne {{\mathcal {Z}}}\). Thus, in Tweedie’s CP model there are more parameters to be estimated if we follow the above guidance.
We close this section by giving the loglikelihoods of Tweedie’s CP DGLM and of the CPG GLM under loglink choices. The one of Tweedie’s CP DGLM is given by
To make the loglikelihood of the CPG GLM directly comparable to (3.34), we make a change of variables \((N_i,{\bar{Z}}_i)\mapsto (N_i, Y_i)\) by setting \(Y_i= N_i {\bar{Z}}_i /w_i\). This gives us loglikelihood
Assuming covariate relationship \({\varvec{x}}^*_i={\varvec{z}}^*_i\) we can reparametrize the first loglikelihood (3.34) by setting \(\varvec{\beta }^+=(2p)\varvec{\beta }^* \varvec{\alpha }^*\) and \(\varvec{\alpha }^+=(1p)\varvec{\beta }^*+ \varvec{\alpha }^*\), this gives us (we drop irrelevant terms)
This proves under \({\varvec{x}}_i^*={\varvec{z}}_i^*={\varvec{x}}_i\cup {\varvec{z}}_i\) that the CPG model is nested in Tweedie’s CP model and we have for given \(p=(\gamma +2)/(\gamma +1)\)
this explicitly uses that we have the same data representation \((N_i,Y_i)_i\) in both loglikelihoods.
Remarks 3.10

Under the assumptions of Theorem 3.8 and additionally assuming that \({\varvec{x}}_i={\varvec{z}}_i={\varvec{x}}^*_i={\varvec{z}}^*_i\), we receive an identity in (3.36). Since the mean estimates in the CPG case do not depend on the particular choice of the shape parameter \(\gamma\), the same must hold true for Tweedie’s CP DGLM model under identical covariates \({\varvec{x}}_i={\varvec{z}}_i={\varvec{x}}^*_i={\varvec{z}}^*_i\). Using Proposition 2.1 we then receive the dispersion scaling of Theorem 3.6, thus, this gives us a second shorter proof for Theorem 3.6.

If \({\varvec{x}}_i^*={\varvec{z}}_i^*={\varvec{x}}_i\cup {\varvec{z}}_i\) and \({\varvec{x}}_i \ne {\varvec{z}}_i\), the CPG model is strictly nested in Tweedie’s CP model and, in general, we do not get an identity in (3.36). In that case, Theorem 3.8 reflects an ideal world because noise in the data prevents MLE estimated parameters (estimated separately in both models) from strictly satisfying the identities in Theorem 3.8.

To perform model selection in the general case we can use Akaike’s information criterion (AIC) [1]. This corrects both sides of (3.36) by the number of regression parameters involved, thus, with AIC the model with the smaller value should be preferred from either
$$\begin{aligned}&2\underset{(\varvec{\beta }^*, \varvec{\alpha }^*)}{\max }~ \ell _\mathrm{Tw}(\varvec{\beta }^*, \varvec{\alpha }^*) +2(d^*+ q^*+ 2) \quad \text { or} \nonumber \\&\quad 2\underset{(\varvec{\beta }, \varvec{\alpha })}{\max } ~\ell _\mathrm{CPG}(\varvec{\beta }, \varvec{\alpha }) +2(d+q+2). \end{aligned}$$(3.37)AIC applies because in both models we use the same data representation \((N_i,Y_i)_i\) and both models are evaluated in the MLEs for the corresponding parameters. We emphasize that for estimating the CPG GLM we use in (3.36) sufficient statistics \(Y_i=N_i{\bar{Z}}_i/w_i\). If, instead, we use the individual claim sizes \(Z_{i,j}\) to estimate the CPG GLM, AIC does not apply because the loglikelihoods to be compared use the available data in different ways.
Numerical examples
We study two numerical examples to benchmark the two modeling approaches of Theorem 3.8. First, we design a synthetic data example that fully meets the assumptions of Theorem 3.8. Thus, there is no model uncertainty involved in this first (synthetic) example about underlying distributions, covariate spaces and link functions, and we can fully focus on estimating parameters with MLE in the CPG GLM and in Tweedie’s CP DGLM. These results are then compared to neural network regression approaches on the same synthetic data. In contrast to GLMs, neural networks explore optimal covariate selection themselves. This is done in Sect. 4.2.2. Our second example in Sect. 4.3 is a real data example. This additionally raises the issue of model uncertainty because the real data has not been generated by a CPG model. Both examples are based on the motorcycle insurance data swmotorcycle used in [11], this data is available through the R package CASdatasets [3], see Listing 1 for an excerpt of the data. For the synthetic data we sample a portfolio of covariates from the original data, and then generate claims with a CPG GLM designed according to the assumptions of Theorem 3.8. For the real data example we fully rely on the swmotorcycle data and we use the corresponding claim observations.
Description of motorcyle data
We briefly describe the data, for more information we refer to “Appendix B”, below. The data comprises comprehensive insurance for motorcycles which covers loss or damage of motorcycles other than collision, for instance, caused by theft, fire or vandalism. The data is aggregated on insurance policy level for years 1994–1998. The data is shown in Listing 1. We have applied some preprocessing, e.g., we have dropped all policies that have an exposure equal to zero.
We briefly describe the variables, the following enumeration refers to lines 2–10 of Listing 1:

2.
Age: age of motorcycle owner in \(\{18,\ldots , 70\}\) years (we cap at 70 because of scarcity above);

3.
Gender: gender of motorcycle owner either being Female or Male;

4.
Zone: seven geographical Swedish zones being (1) central parts of Sweden’s three largest cities, (2) suburbs and middlesized towns, (3) lesser towns except those in zones (5)–(7), (4) small towns and countryside except those in zones (5)–(7), (5) Northern towns, (6) Northern countryside, and (7) Gotland (Sweden’s largest island);

5.
McClass: seven ordered motorcycle classes received from the socalled EV ratio defined as (Engine power in kW \(\times\) 100)/(Vehicle weight in kg \(+\) 75 kg);

6.
McAge: age of motorcycle in \(\{0,\ldots , 30\}\) years (we cap at 30 because of scarcity beyond);

7.
Bonus: ordered bonusmalus class from 1 to 7, entry level is 1;

8.
Exposure: total exposure in yearly units in interval [0.0274, 31.3397], the shortest entry referring to 1 day and the longest one to more than 31 years;^{Footnote 1}

9.
ClaimNb: number of claims N on the policy;

10.
ClaimCosts: total claim costs \(S=\sum _{j=1}^N Z_j\) on the policy (thus, we do not have information about individual claims \(Z_j\) but only about sufficient statistics \({\bar{Z}}\) on each policy).
The data is illustrated in “Appendix B”.
Synthetic data example
This section is based on synthetic (simulated) data from a CPG GLM.
A generalized linear model approach
We start by describing the simulation of the synthetic data. We randomly choose \(n=250'000\) insurance policies from dat=swmotorcycle using the R code:
Based on this portfolio we generate claims (N, Y) using two GLMs that fulfill the CPG assumptions of Theorem 3.8, the modeling details are specified in columns 1–3 of Table 1. We especially emphasize that the covariate spaces \({{\mathcal {X}}}\) and \({{\mathcal {Z}}}\) differ for claim counts and claim sizes.
CPG GLM We estimate the Poisson claim counts GLM and the gamma claim amounts GLM separately, according to Sect. 3.1 and under loglink choices. The results are presented in column ‘estimated CPG’ of Table 1, the brackets provide one estimated standard deviation received from the inverse of Fisher’s information matrix. Note that we can estimate all regression parameters \(\beta _k\) and \(\alpha _k\) without specifying shape parameter \(\gamma >0\) explicitly. Most estimated parameters are within one standard deviation of the true parameter values. The true parameters have been chosen such that they resemble the true data swmotorcycle. The true data has an observed claim frequency of only 1.05%, see “Appendix B”. In the present example, claims are scarce too, and the gamma claim size GLM has been estimated on (only) 2’795 claims. The parameter estimates are remarkably accurate (we do not have model uncertainty here, only parameter estimation uncertainty). We conclude that this model can be calibrated well using the separate approach for claim counts and claim amounts.
Figure 1 (lhs) considers the loglikelihood function \(\gamma \mapsto \ell (\widehat{\varvec{\alpha }}, \gamma )\) of the gamma GLM to estimate shape parameter \(\gamma\), we also refer to score Eq. (3.9). From this we find MLE \({\widehat{\gamma }}=1.56\), and the inverse of Fisher’s information matrix provides an estimated standard deviation of 0.04 for this estimate. Thus, the estimated shape parameter is slightly too high, though still within two standard deviations of the true value of \(\gamma =1.5\). We again highlight that this estimate is based on only 2’795 claims. Moreover, we remark that \({\widehat{\gamma }}\) has been used in the standard deviation estimates of Table 1, see (3.8).
Tweedie’s DGLM Next we turn our attention to Tweedie’s CP case. The true values \(\varvec{\beta }\), \(\varvec{\alpha }\) and \(\gamma\) as well as their MLE counterparts \(\widehat{\varvec{\beta }}\), \(\widehat{\varvec{\alpha }}\) and \({\widehat{\gamma }}\) from the CPG model are transformed with Theorem 3.8 to receive the same model in Tweedie’s CP parametrization, this is illustrated in the first four columns of Table 2. In a first calibration step for Tweedie’s CP model, we choose \(p=1.39\) which is the optimal power variance parameter estimate of the CPG model, see last line in column 4 of Table 2. We then calibrate Tweedie’s CP DGLM model for this power variance parameter p with Fisher’s scoring method (3.25)–(3.26); as starting values for the algorithm we use the estimates from the CPG model (in italic in Table 2). Fisher’s scoring method converges in 7 iterations with these initial values. Due to (3.36) we receive a model that has a bigger loglikelihood than its CPG counterpart (we include all constants in this consideration so that the loglikelihoods are directly comparable).
In the next step, we optimize over the power variance parameter p. Therefore, we use Theorem 3.6, which says that the mean estimates \({\widehat{\mu }}_i\) do not depend on p, and which provides the pscaling for dispersion parameter MLEs \({\widehat{\phi }}_i(p)\). This allows us to directly plot the profile loglikelihood \(p\mapsto \ell (\widehat{\varvec{\beta }}^*, \widehat{\varvec{\alpha }}^*(p),p)\) as a function of \(p \in (1.36, 1.41)\), see Fig. 1 (rhs). From this figure, we find maximizing value \({\widehat{p}}=1.39\), which is close to the true value of \(p=1.4\). The second last column in Table 2 shows the resulting MLEs \(\widehat{\varvec{\beta }}^*\) and \(\widehat{\varvec{\alpha }}^*({\widehat{p}})\) of the optimal Tweedie’s CP model. A first observation is that the parameter estimates from Tweedie’s CP model are not as close to the true values as the MLEs from the CPG model. However, model selection should not be based on this observation: note that the (true) CPG model has 22 parameters and Tweedie’s CP model has 33 parameters, therefore, we expect some differences in model calibration.
We summarize the two estimated models in Table 3. On row (a) we compare the loglikelihoods \(\ell _\mathrm{CPG}(\widehat{\varvec{\beta }},\widehat{\varvec{\alpha }}, {\widehat{p}})\) and \(\ell _\mathrm{Tw}(\widehat{\varvec{\beta }}^*,\widehat{\varvec{\alpha }}^*, {\widehat{p}})\) of the estimated models CPG and Tweedie’s CP, see also (3.36), to the one of the true model \(\ell ({\varvec{\beta }^*},{\varvec{\alpha }^*}, {p})\): we observe that both models slightly overfit to the data, with Tweedie’s CP model having a slightly larger overfit [this is consistent with (3.36)]. Therefore, we penalize in AIC the loglikelihoods of the models by the number of parameters involved, see (3.37). The AIC values are given on row (b) of Table 3, and we give preference to the CPG calibration. Performing a likelihoodratio test having the CPG model as null hypothesis model nested in Tweedie’s CP model, gives a pvalue of 34%, thus, we do not reject the null hypothesis on a 5% significance level. This gives support that we should go for the smaller CPG model in this example. Row (c) of Table 3 gives the rooted mean square error (RMSE) between the true model means \(w_i\mu _i\) and their estimated counterparts \(w_i{\widehat{\mu }}_i=w_i{\widehat{\lambda }}_i{\widehat{\zeta }}_i\); rows (d)–(g) show average means and dispersions as well as the corresponding standard deviations. We observe that these figures match the true values quite well. Recall that these figures are based on one simulation from the true model for each insurance policy, thus, they involve simulation error (but they do not involve model error because we only assume parameters \(\varvec{\beta }\), \(\varvec{\alpha }\) and p as unknown in this example). Moreover, we remark that the dispersion is not underestimated, here, we also refer to the last bullet point of Remark 3.5.
Finally, in Fig. 2 we plot the predicted means \({\widehat{\mu }}_i\) against the true values \(\mu _i\). The lefthand side compares the two estimated models against the true model, and the righthand side compares the two estimated models against each other. From these plots we conclude that both models are very accurate, the CPG estimated one (orange) being slightly closer to the true model than its Tweedie’s CP counterpart (green). Summarizing: This synthetic example gives evidence supporting industry practice on focusing on the CPG model. Specifying covariate spaces is easier in the CPG case because systematic effects of claim counts and claim amounts are clearly separated, and in our example accuracy is slightly higher because Tweedie’s CP seems to slightly overfit in our example.
A neural network regression approach
Next we explore neural network regression models on the same synthetic data. Neural networks have the capability of representation learning which means that they can perform covariate engineering themselves, we refer to Sections 4 and 5 of [21]. Therefore, covariates can be provided in their raw form to neural networks. The neural networks then, at the same time, preprocess these covariates and predict the response variables. Starting from a GLM, the required changes to achieve this representation learning are comparably small. We illustrate this in the present section. Alternatively, one may also be interested in using generalized additive models (GAMs). GAMs are more flexible in modeling different functional forms in the components of the covariates compared to GLMs, however, they do not automatically allow for flexible interaction modeling between covariate components. For this reason, we favor neural networks over GAMs.
We first define the (raw) covariate space \({{\mathcal {X}}}^\dagger\) which is going to be used throughout this section:
where we use dummy coding for the categorical variable \(\mathtt{Zone}\in \{0,1\}^4\). In contrast to Table 1, we do not specify the continuous variables in all its functional forms, but we let the neural network find these functional forms. A neural network is a function
that consists of a composition of a fixed number of hidden network layers, each of them having a certain number of hidden neurons. For an explicit mathematical definition we refer to Section 3.1 in [21]. \({\varvec{x}}^\dagger\) has the interpretation of being the raw covariate, and \({\varvec{x}}=\psi ({\varvec{x}}^\dagger ) \in {{\mathbb {R}}}^d\) can be interpreted as the (network) preprocessed covariate. These preprocessed covariates \(\psi ({\varvec{x}}^\dagger )\) are then used in a classical GLM, e.g., for claim counts we may set for the loglink choice, see (3.2),
note that we use a slight abuse of notation here because strictly speaking \(\psi ({\varvec{x}}^\dagger )\) does not include an intercept term for \(\beta _0\), so this always needs to be added. Neural network regression function (4.3) involves regression parameters \(\varvec{\beta } \in {{\mathbb {R}}}^{d+1}\) as well as network weights \(\vartheta \in {{\mathbb {R}}}^r\) which parametrize network function \(\psi =\psi _\vartheta\). The dimension r of \(\vartheta\) depends on the complexity of the chosen network \(\psi\). Network fitting now trains at the same time network parameter \(\vartheta\) for an optimal covariate preprocessing as well as GLM parameter \(\varvec{\beta }\) for response prediction. Stateoftheart fitting uses variants of the gradient descent algorithm, and a good performance depends on the complexity of \(\psi\), we just mention the universal approximation property of appropriately designed neural networks. For more information, we refer to the relevant literature, in particular, to [21]. Based on this reference we explore (4.3) and its counterparts for claim counts and Tweedie’s CP model. In all three prediction problems we use the identical covariate space \({{\mathcal {X}}}^\dagger\), and only network function \(\psi\) will differ in the weights \(\vartheta\) to bring covariates into the appropriate form for the corresponding prediction task.
Poisson claim counts We start by modeling claim counts using neural network approach (4.3). We use the R library keras to implement this, and we use exactly the same architecture as in Listing 4 of [14], the only thing that changes is the dimension of \({{\mathcal {X}}}^\dagger\) from 40 on line 1 of Listing 4 in [14] to 8 in the present example, see (4.1). This results in \(r=655\) and \(d+1=11\) parameters. We fit these parameters in the usual way by considering 80% of the data for training and 20% of the data for outofsample validation to track overfitting in the gradient descent algorithm (we run 100 epochs on batch size 5000). We then choose the parameter that has the best outofsample performance on the validation data. To this network solution we apply the bias regularization step of Listing 5 in [21] to make the model unbiased.
On rows (a1)–(a2) of Table 4 we present the results for the claim counts neural network model. We provide the Poisson deviance losses of the true model \(\lambda _i\) (which is known here because we simulate from this model), the intercept model that does not use covariate information (i.e. is only based on intercept parameter \(\beta _0\)), the claim counts GLM (upper part of Table 1) and its neural network counterpart. We observe that both regression models slightly overfit to the data \(8.4366\cdot 10^{2}\) and \(8.4393\cdot 10^{2}\), respectively, compared to the true model loss of \(8.4431\cdot 10^{2}\).
On row (a2) we provide the RMSE between the true model means \(\lambda _i\) and the estimated ones \({\widehat{\lambda }}_i\). We note that the Poisson GLM has a smaller RMSE than the neural network Poisson regression model. This is not surprising because the Poisson GLM uses the right functional form (no model uncertainty) and only estimates regression parameter \(\varvec{\beta }\) whereas the neural network regression model also determines this functional form for the raw covariates \({\varvec{x}}^\dagger\). In Fig. 3 (lhs) we compare the resulting estimated frequencies to the true ones on all individual insurance policies \(i=1,\ldots , n\). From this plot we conclude that both models do a fairly good job because the dots lie more or less on the diagonal (which reflects the perfect model).
Gamma claim sizes Next we consider a neural network approach for the gamma claim sizes. This essentially means that we replace linear predictor (3.5) by the following neural network predictor (under a loglink choice for \(g_2\))
where \(\psi\) is a neural network function (4.2) that may have the same structure as the one used for the Poisson regression model (4.3), but typically differs in network weights \(\vartheta\). For simplicity, we use exactly the same neural network architecture as in the Poisson case, only the exposure offset is dropped and the Poisson deviance loss function is changed to the gamma deviance loss function (including weights), in line with the distributional assumptions made.
The results are presented on rows (b1)–(b2) of Table 4 and Fig. 3 (rhs) (we run 1000 epochs on batch size 5000 and we callback the model with the smallest validation loss). Again we receive reasonably good results from the network approach, i.e., covariate engineering on \({{\mathcal {X}}}^\dagger\) is done quite well by the network, we emphasize that these results are based on only 2’795 claims. But we also see from Fig. 3 (rhs) that individual predictions spread more around the diagonal than in the gamma GLM case (where we assume perfect knowledge about the functional form of the regression function). Better accuracy can only be achieved by having more claim observations.
Next, we estimate the shape parameter \(\gamma\). This is done analogously to the gamma GLM case by plotting the corresponding loglikelihood \(\ell (\gamma )\) as a function of \(\gamma\). This gives estimate \({\widehat{\gamma }}=1.57\), which is slightly too large but still reasonable compared to the true value of \(\gamma =1.5\). A too high shape parameter implies a too low dispersion, which is a sign of overfitting to the observations.
We conclude with the summary statistics for the neural network approaches in Table 5 column ‘estimated CPG’, which look fairly similar to the GLM ones in Table 3. We obtain a larger RMSE, which is not surprising because we have more model uncertainty due to missing covariate knowledge, this is also obvious from Fig. 3.
Tweedie’s compound Poisson neural network approach First, we remark that, in general, there is no simple comparison between a CPG and a Tweedie CP neural network approach similar to (3.34)–(3.35). The relation (3.34)–(3.35) is strongly based on the fact that we can directly compare linear predictors under suitable choices of covariate spaces. Since the networks given (4.2) transform covariates in a nontrivial way under nonlinear activation functions, there is no hope to get an easy comparison between the models unless the network architectures are chosen in a very specific way, i.e. artificial way, so to say. Therefore, we do not aim to nest the CPG neural network into Tweedie’s CP neural network model, but we directly focus on modeling the latter. This essentially implies that we have to replace linear predictors (3.21)–(3.22) by the following twodimensional neural network predictors (under loglink choices for \(g_p\) and \(g_d\))
where \(\psi\) is a neural network function (4.2). The first component of \((\mu ,\phi )({\varvec{x}}^\dagger ) \in {{\mathbb {R}}}^2_+\) predicts the total claim costs Y and the second component estimates the dispersion parameter \(\phi\). We use one network \(\psi\) to simultaneously perform this prediction task for mean and dispersion parameter. We implement this in the R library keras and we use the same architecture as in Listing 4 of [14], but we need to change the input dimension to 8 and the output dimension to 2. The exposures \(w_i\) are treated as weights as follows
This requires a custom made loss function in keras for parameter estimation, the details are provided in Listing 2 in the “Appendix”. We fit this model with the gradient descent algorithm exactly using the same methodology as outlined above (callback of the lowest validation loss model after 100 epochs on batch sizes 5000).
In order to come up with the optimal neural network model we need to fit neural networks for multiple power variance parameters p, because there is no result similar to Theorem 3.6 that allows for a shortcut. Of course, this disadvantages Tweedie’s CP neural network model from a computational point of view. We come up with an optimal power variance parameter estimate of \({\widehat{p}}=1.390\), which yields then the results in the last column of Table 5. From the figures on rows (c)–(g) we conclude that Tweedie’s CP approach is not fully competitive with the CPG fitting. These differences are also illustrated in Fig. 4 with the CPG approach being slightly closer to the true model means. Nevertheless, all these estimates look very reasonable and the estimated neural network seems to capture the crucial features of the true model.
Conclusions from our synthetic data example Our findings support industry practice of focusing on the CPG parametrization. Our estimated models based on this parametrization are closer to the true model than the ones obtained from Tweedie’s CP parametrization. If we work under GLM assumptions we need to preprocess covariates which is easier in the CPG parametrization because systematic effects of claim counts and claim amounts can be separated. If we work under neural network regression models, model calibration is not efficient under Tweedie’s CP parametrization because we need to run gradient descent algorithms on multiple power variance parameters p to find the optimal model. Moreover, in our example, the CPG case leads to more accurate predictive models.
Real data example: an outlook
In view of the previous example everything seems to be fairly clear. However, our synthetic data is based on the very strong property of having gamma claim sizes with constant shape parameter \(\gamma\) over the whole insurance portfolio. This assumption may be critical in real insurance applications. We briefly analyze it in terms of our real data example given in “Appendix B”, and we give an outlook in case this assumption is not fulfilled. We keep this section very short, and we mainly view it as a motivation to conduct future research in this direction.
There are two possibilities in which the constant shape parameter assumption may fail, either the claim sizes are gamma distributed, but the shape parameter \(\gamma _i\) is also insurance policy i dependent, or the gamma distribution is inappropriate due to that the claim sizes exhibit too heavy tails. We explore this on the real data example provided in “Appendix B”. For this it suffices to focus on the gamma claim size model, i.e. we do not study claim counts in this real data example. Moreover, to minimize covariate preprocessing we explore a gamma neural network regression model on these claim sizes, the chosen model architecture is identical to the one used in (4.4), in particular, it does covariate engineering itself.
Table 6 shows the (insample) gamma deviance losses of the intercept model and the neural network regression model. Obviously, the neural network approach has a better performance (note that the network model has been received by a proper trainingvalidation analysis as described above). Using the resulting mean estimates \({\widehat{\zeta }}_i\) we can estimate the (constant) shape parameter \(\gamma\). This is illustrated in Fig. 5 (lhs): we estimate \({\widehat{\gamma }}=0.75\). Thus, we receive a shape parameter smaller than 1, which provides overdispersion \(1/{\widehat{\gamma }}=1.33>1\), i.e., the estimated gamma densities are strictly decreasing. This fact requires further examination because there might be two situations: either the true shape parameter is smaller than 1 (and everything is fine), or the claim sizes are more heavy tailed than a gamma distribution allows. This is typically compensated by overdispersion in the estimated model. We analyze this warning signal on our real data.
Figure 5 (rhs) gives the Tukey–Anscombe plot of the gamma deviance residuals against the fitted means. This plot supports the model choice because we cannot see any particular structure in the figure, it also supports the constant shape parameter assumption on \(\gamma\). Figure 6 gives the QQplot and it compares the observed claims against one simulation from the fitted model. Also these two plots look quite reasonable, one may only question the upper tail of the QQplot.
Conclusions The short analysis on the real data has shown that for the motorcycle claims data the gamma claim size model is fairly reasonable, thus, supporting the CPG model. On different data, one may relax the constant shape parameter assumption on \(\gamma\). This may result in a DGLM for gamma claim sizes (which is known in industry) and a Poisson GLM for claim counts. Again this model can easily be fitted in the Poissongamma parametrization, however, this approach does not have a Tweedie’s CP counterpart relying on a fixed parameter p, giving more support to the industry preference of choosing the Poissongamma parametrization.
Conclusion
We have revisited the compound Poisson model with i.i.d. gamma claim sizes. This model allows for two different parametrizations, namely, the Poissongamma parametrization and Tweedie’s compound Poisson parametrization. We have provided results for GLMs illustrating when the two parametrizations are identical, and we have provided a theorem that allows for efficient fitting of power variance parameters in Tweedie’s parametrization (under loglink choices for the GLMs).
In the applied section, we have analyzed why the insurance industry gives preference to the Poissongamma parametrization. Based on examples, we find that, indeed, this parametrization is easier to fit, and results turn out to be more accurate in our examples. In particular, under neural network regression models we give a clear preference to the Poissongamma parametrization because Tweedie’s version does not possess an easy and efficient way in estimating the power variance parameter. That is, the Tweedie version is computationally clearly lacking behind the Poissongamma case.
For our real data example it turns out that the gamma claim size model with constant shape parameter is quite reasonable. However, in many other applications this is not the case. Therefore, insurance industry explores double GLMs for a flexible modeling of shape parameters of claim sizes; on the other hand, a casedependent p modeling in Tweedie’s compound Poisson parametrization is not (easily) feasible. For modeling more heavy tailed claim sizes, mixture models are a promising proposal.
Notes
For a rigorous pricing exercise one should truncate longer exposures, say, to one accounting year, otherwise one implicitly considers a survival bias on policies with longer exposures, supposed that people give up motorcycling more likely after a claim.
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
BarndorffNielsen O (2014) Information and exponential families. In: Statistical theory. John Wiley & Sons, Chichester, UK
Dutang C, Charpentier A (2019) CASdatasets \({\mathtt{R}}\) package vignette. Reference manual, November 13, 2019. Version 1.010
Jørgensen B (1986) Some properties of exponential dispersion models. Scand J Stat 13(3):187–197
Jørgensen B (1987) Exponential dispersion models. J R Stat Soc Ser B (Methodol) 49/2:127–145
Jørgensen B, de Souza MCP (1994) Fitting Tweedie’s compound Poisson model to insurance claims data. Scand Actuar J 1994(1):69–93
McCullagh P, Nelder JA (1983) Generalized linear models. Chapman & Hall, London
Nelder JA, Pregibon D (1987) An extended quasilikelihood function. Biometrik 74:221–231
Nelder JA, Wedderburn RWM (1972) Generalized linear models. J R Stat Soc Ser A (Gen) 135/3:370–384
NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/ Release 1.0.28 of 20200915. In: Olver FWJ, Olde Daalhuis AB, Lozier DW, Schneider BI, Boisvert RF, Clark CW, Miller BR, Saunders BV, Cohl HS, McClain MA (eds)
Ohlsson E, Johansson B (2010) Nonlife insurance pricing with generalized linear models. Springer, Berlin
Quijano Xacur OA, Garrido J (2015) Generalised linear models for aggregate claims: to Tweedie or not? Eur Actuar J 5(1):181–202
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.Rproject.org/
Schelldorfer J, Wüthrich MV (2019) Nesting classical actuarial models into neural networks. SSRN Manuscript ID 3320525. Version of January 22, 2019
Smyth GK (1989) Generalized linear models with varying dispersion. J R Stat Soc Ser B (Methodol) 51:47–60
Smyth GK (1996) Partitioned algorithms for maximum likelihood and other nonlinear estimation. Stat Comput 6:201–216
Smyth GK, Jørgensen B (2002) Fitting Tweedie’s compound Poisson model to insurance claims data: dispersion modeling. ASTIN Bull 32(1):143–157
Smyth GK, Verbyla AP (1999) Adjusted likelihood methods for modelling dispersion in generalized linear models. Environments 10:696–709
Tweedie MCK (1984) An index which distinguishes between some important exponential families. In: Ghosh JK, Roy J (eds) Statistics: applications and new directions. Proceeding of the Indian statistical golden jubilee international conference. Indian Statistical Institute, Calcutta, pp 579–604
Wüthrich MV (2013) Nonlife insurance: mathematics & statistics. SSRN Manuscript ID 2319328. Version of January 7, 2020
Wüthrich MV (2019) From generalized linear models to neural networks, and back. SSRN Manuscript ID 3491790. Version of April 3, 2020
Wüthrich MV (2020) Bias regularization in neural network models for general insurance pricing. Eur Actuar J 10(1):179–202
Funding
Open Access funding provided by ETH Zurich.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Generalized linear models
GLMs have been introduced in [9], and they have been studied in the monograph [7]. GLMs are based on the EDF (2.2). The EDF has been studied extensively in [2, 4, 5], and its properties have been revisited in [21]. The original introduction of EDF distributions (2.2) is constructive from which it follows that the effective domain \(\varvec{\Theta }\) is a convex set and that the cumulant function \(\kappa\) is a smooth and convex function on the interior of the effective domain \(\mathring{\varvec{\Theta }}\). Moreover, we get the following moments for Y having EDF distribution (2.2)
for r sufficiently small such that \(\theta +r\phi /w \in \mathring{\varvec{\Theta }}\) for \(\theta \in \mathring{\varvec{\Theta }}\). Convexity of \(\kappa\) implies existence of the canonical link providing canonical parameter and variance function, respectively,
GLMs are based on a linear predictor \(\eta\) for modeling the mean parameter \(\mu ={{\mathbb {E}}}[Y]\). Assume we have \((d+1)\)dimensional covariates \({\varvec{x}}\in {{\mathcal {X}}}=\{1\} \times {{\mathbb {R}}}^d\). The linear predictor \(\eta =\eta ({\varvec{x}})\) is received by choosing a suitable link function \(g(\cdot )\) such that the following relationship holds
for a given regression parameter \(\varvec{\beta }\in {{\mathbb {R}}}^{d+1}\). We need to ensure to have a welldefined GLM by
This might be a challenge for (onesided) bounded effective domains \(\varvec{\Theta }\) and may require a careful choice of the link function \(g(\cdot )\).
Assume we have n independent pairs of random variable and covariates \((Y_i,{\varvec{x}}_i)\) following an EDF distribution (2.2) with the same cumulant function \(\kappa\); we choose the same link function \(g(\cdot )\) to receive linear predictors \(\eta _i = \left\langle \varvec{\beta }, {\varvec{x}}_i \right\rangle\). The loglikelihood function of this model is
with canonical parameter \(\theta _i=(\kappa ')^{1}(\mu _i)=\left( (\kappa ')^{1} \circ g^{1}\right) (\eta _i)\). The score w.r.t. \(\varvec{\beta }\) is obtained by the gradient
We define the diagonal working weight matrix W and working residual vector \({\varvec{R}}\) by
This allows us to write the score equation for finding the MLE of regression parameter \(\varvec{\beta }\) by
with design matrix \({\mathfrak {X}}=({\varvec{x}}_1,\ldots , {\varvec{x}}_n)'\in {{\mathbb {R}}}^{n \times (d+1)}\). MLE system (A.3) is solved either using Fisher’s scoring method or the iteratively reweighted least squares (IRLS) algorithm, see [7, 9]. For Fisher’s scoring method we explore the scoring updates
where all terms on the righthand side are evaluated for algorithmic time t. It has been pointed out by an anonymous referee that the R command glm() does not directly calculate the inverse of the matrix \({\mathfrak {X}}'W{\mathfrak {X}}\) in (A.4), but, instead, solves a linear system for \(\varvec{\beta }_{t+1}\). The motivation for this approach is that in highdimensional covariate spaces or in the situation of multiple categorical variables with many labels (implemented by dummy coding), the matrix \({\mathfrak {X}}'W{\mathfrak {X}}\) may be close to singular and, henceforth, inversion of this matrix may lead to unstable results.
Standard errors are obtained from the inverse of Fisher’s information matrix
where \(\nabla ^2_{\varvec{\beta }}\) denotes the Hessian w.r.t. \({\varvec{\beta }}\). The IRLS algorithm replaces the inverse Fisher’s information matrix \({{\mathcal {I}}}(\varvec{\beta })^{1} =({\mathfrak {X}}'W{\mathfrak {X}})^{1}\) in the scoring updates by the inverse of the observed information matrix
B Motorcycle data example
We start with a descriptive and exploratory analysis of the Swedish motorcycle data of Listing 1. We have \(n=62'036\) insurance policies with positive exposures \(w_i>0\). The empirical claim frequency is \({\bar{\lambda }}=\sum _{i=1}^n N_i/\sum _{i=1}^n w_i=1.05\%\), and the average claim size is \({\bar{\zeta }}=\sum _{i=1}^{n} \sum _{j=1}^{N_i}Z_{i,j}/\sum _{i=1}^{n}N_i = 24'641\) Swedish crowns SEK.
Figure 7 shows a boxplot over all exposures \(w_i\) and the claim counts \(N_i\) on all insurance policies. We note that insurance claims are rare events for this product, because the claim frequency is only \({\bar{\lambda }}= 1.05\%\).
Figures 8 and 9 give the marginal total exposures (split by gender), the marginal claim frequencies and the marginal average claim amounts for the covariate components Age, Zone, McClass, McAge and Bonus. The first observation is that we have a very imbalanced portfolio between genders, only 11% of the total exposure is coming from females. The empirical claim frequency of females is 0.86% and the one of males is 1.08%. We note that the female claim frequency comes from (only) 61 claims (based on an exposure of female of 7’094 accounting years, versus 57’679 for male). Therefore, it is difficult to analyze females separately, and all marginal claim frequencies and claim sizes in Figs. 8 and 9 (middle and rhs) are analyzed jointly for both genders. Average claim sizes are 18’237 SEK and 25’270 SEK for female and male, respectively.
The empirical marginal frequencies in Figs. 8 and 9 (middle) are complemented with confidence bounds of two standard deviations (blue dotted lines) and the empirical overall frequency \({\bar{\lambda }}= 1.05\%\) (orange color). From the plots we conclude that we should keep the explanatory variables Age, Zone, McClass and McAge, but the variable Bonus does not seem to have any predictive power. At the first sight, this seems surprising because the bonusmalus level encodes the past claims history. The reason that the bonusmalus level is not needed for our claims is that we consider comprehensive insurance for motorcycles covering loss or damage of motorcycles other than collision (for instance, caused by theft, fire or vandalism), and the bonusmalus level encodes collision claims. The situation for average claim amounts is a bit more difficult to understand, but we make a similar conclusion, namely, that we can drop the covariate Bonus. Moreover, we merge Zones 5–7 because of small exposures and similar behavior.
Figure 10 shows the correlations between the covariates: (lhs) correlations between continuous covariates, (plots rhs), dependence between continuous covariates and the categorical Zone covariate. We have some dependence, for instance, in Zone 1 (three largest Swedish cities) motorcycles are more light (McClass) and less old. Older people drive less heavy motorcycles that are more old, and older motorcycles are less heavy.
Figure 11 gives the empirical density, empirical distribution and loglog plot of average claim amounts \({\bar{Z}}_i\). From the loglog plot we conclude that the average claim amounts are not heavy tailed, which does not reject the use of gamma claim size distributions at that stage.
C R code
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Delong, Ł., Lindholm, M. & Wüthrich, M.V. Making Tweedie’s compound Poisson model more accessible. Eur. Actuar. J. 11, 185–226 (2021). https://doi.org/10.1007/s13385021002643
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13385021002643
Keywords
 Compound Poisson model
 Gamma claim sizes
 Tweedie’s distribution
 Exponential dispersion family
 Regression model
 Generalized linear model
 Neural network