Making Tweedie’s compound Poisson model more accessible

The most commonly used regression model in general insurance pricing is the compound Poisson model with gamma claim sizes. There are two different parametrizations for this model: the Poisson-gamma parametrization and Tweedie’s compound Poisson parametrization. Insurance industry typically prefers the Poisson-gamma parametrization. We review both parametrizations, provide new results that help to lower computational costs for Tweedie’s compound Poisson parameter estimation within generalized linear models, and we provide evidence supporting the industry preference for the Poisson-gamma parametrization.


Introduction
The most commonly used regression model in general insurance pricing is the compound Poisson model with gamma claim sizes. State-of-the-art industry practice fits two separate generalized linear models (GLMs) to the two parts of this model, namely, a Poisson GLM to claim counts and a gamma GLM to claim amounts. Both the Poisson and the gamma distributions belong to the exponential dispersion family (EDF). It has been noted by Tweedie [19] that the compound Poisson model with i.i.d. gamma claim sizes itself belongs to the EDF and, in fact, it closes the interval of power variance functions between the Poisson model and the gamma model, see Section 3 in Jørgensen [5]. As a result of Tweedie's and Jørgensen's findings we obtain two different parametrizations of the compound Poisson model with i.i.d. gamma claim sizes. Selection between these two different parametrizations has been explored in the work of Jørgensen-de Souza [6] in the context of GLM insurance pricing. Interestingly, to predict total claim amounts we need to fit two GLMs in the compound Poisson-gamma parametrization, whereas one GLM is sufficient to get the corresponding predictions within Tweedie's EDF parametrization. This indicates that in GLM applications these two parametrizations are not fully consistent. This point has been raised by Smyth-Jørgensen [17] who propose to use a double generalized linear model (DGLM) in Tweedie's parametrization to simultaneously model mean and dispersion parameters within the EDF.
The main purpose of this article is to revisit the work of Smyth-Jørgensen [17], and to give properties under which the two GLMs for claim counts and claim sizes and the DGLM with Tweedie's EDF parametrization lead to the same predictive model; this involves a discussion about choices of covariate spaces and GLM link functions. Based on this, our first main contribution provides a new result for Tweedie's DGLM that substantially reduces computational costs in calibrations of power variance parameters.
The second point that we explore is whether the insurance industry's preference of using the Poisson-gamma parametrization can be justified. A priori it is not clear whether either of the two ways lead to better predictive models. This part of our work is based on GLMs and on their neural network extensions. We receive evidence that supports the industry preference, in particular, under the choice of neural network regression models the Poisson-gamma parametrization is simpler in calibration and leads to more robust results.
We close this introduction with a number of remarks. First, we mention the recent survey paper of Quijano Xacur-Garrido [12], which has similar goals to the present paper. This survey only considers the single GLM case of Tweedie's parametrization, similar to Jørgensen-de Souza [6]. We emphasize that the full picture can only be obtained by comparing the Poisson-gamma parametrization to the DGLM case introduced in Smyth-Jørgensen [17]. Therefore, we revisit and extend this latter reference to receive a comprehensive comparison. Our view is supported by examples. These examples provide a proof of concept for situations with claims that are not too heavy tailed. However, these examples also highlight the weaknesses of this model on real insurance data, which often exhibits heavier tails than what is suitable under a gamma assumption. We remark that in our discussion we use the terminology of general insurance pricing, however, as commonly the case in general insurance, all our findings can be translated one-to-one to claims reserving problems.
Organization of the paper In Sect. 2 we introduce the compound Poisson model with i.i.d. gamma claim sizes and we derive its corresponding Tweedie parametrization. In Sect. 3 we embed both approaches into a GLM framework. We present the two GLMs needed for the Poisson-gamma parametrization, and we discuss a single GLM and a DGLM parametrization for Tweedie's approach. Our main results, Theorems 3.6 and 3.8, give conditions under which the different GLM parametrizations lead to identical predictive models. These theorems provide a remarkable property that allows us to lower calibration costs in Tweedie's DGLMs. In Sect. 4 we give insights and intuition based on numerical examples both under GLMs and neural network regression models. In Sect. 5 we conclude, and the "Appendix" gives a short summary of GLMs and describes the data used.

Tweedie's compound Poisson model
In Sect. 2.1 we introduce the compound Poisson model with i.i.d. gamma claim sizes, and in Sect. 2.2 we revisit its Tweedie counterpart. For simplicity, in these two sections, we think of using these models for modeling one single insurance policy only. In Sect. 3, below, we consider multiple insurance policies also allowing for heterogeneity between policies.

Compound Poisson model with i.i.d. gamma claim sizes
Let N be the number of claims and let (Z j ) j≥1 be the corresponding claim sizes. We assume that the number of claims, N, is Poisson distributed with mean w , where > 0 is the expected claim frequency relative to a given exposure w > 0 ; we write N ∼ Poi( w) . We assume that the claim sizes Z j , j ≥ 1 , are i.i.d. and independent of N having a gamma distribution with shape parameter > 0 and scale parameter c > 0 ; we write Z 1 ∼ G( , c) for this gamma distribution. The moment generating function of the gamma claim sizes is given by, see Section 3.2.1 in [20], The compound Poisson model with i.i.d. gamma claim sizes (CPG) is then defined by S = ∑ N j=1 Z j ; we use notation S ∼ CPG( w, , c) . The moment generating function of S is given by we refer to Proposition 2.11 in [20].

Tweedie's compound Poisson model
Following [5,6,17,19] we select a particular model within the EDF. A random variable Y belongs to the EDF if its density has the following form (w.r.t. a -finite measure on ℝ) for r<c.
For properties of the EDF we refer to "Appendix A", below. Tweedie's compound Poisson (CP) model is obtained by choosing for p ∈ (1, 2) the cumulant function We use notation Y ∼ Tweedie( , w, , p) . The first two derivatives of the cumulant function provide the first two moments of Y, see also (A.1) in the "Appendix", Hyper-parameter p ∈ (1, 2) allows us to model the power variance functions V( ) = p between the Poisson boundary case p = 1 and the gamma boundary case p = 2 , we refer to Sect. 3.1, below, for the boundary cases. ↦ = ( � p ) −1 ( ) gives the canonical link of Tweedie's CP model.
We calculate the moment generating function of the exposure scaled Tweedie's CP random variable wY, see also Corollary 7.21 in [20], Note that this is a CPG model in a different parametrization; we call the model under this EDF parametrization Tweedie's CP model. The following proposition follows by comparing the corresponding moment generating functions. Proposition 2.1 Choose S ∼ CPG( w, , c) and Y ∼ Tweedie( , w, , p) . We have identity in distribution S∕w (d) =Y under parameter identification . This implies, using (2.7) in the second step and (2.6) in the last step, The latter says that, of course, the expected claim frequency is obtained by dividing the expected total claim amount [Y] = by the average claim size [Z 1 ] = ∕c.
Thus, under parameter identification scheme (2.6)-(2.8) the two models are identical: This illustrates that there is a one-to-one correspondence between the CPG parametrization and Tweedie's CP parametrization, i.e. the two models are identical and only differ in interpretation of parameters. The next section will demonstrate that these subtle differences can be crucial for GLM regression modeling, and resulting models can be rather different as functions of explanatory covariates, see Sect. 3.3 below.

Generalized linear models and parameter estimation
In this section we study multiple insurance policies i = 1, … , n having claim distributions CPG( i w i , , c i ) and Tweedie( i , w i , i , p) , respectively. We allow for heterogeneity between the policies in all parameters that have a lower index i. We describe modeling and parameter estimation within GLMs: we consider two GLMs to model i (Poisson) and −c i ∕ (gamma) in the former case, and we consider a DGLM to model i and i in the latter case. There is a slight difference between "two GLMs" and "double GLM", the former considers two independent GLMs, the latter does a simultaneous consideration of two GLMs. The volumes w i are assumed to be known and do not need any modeling. The shape parameter > 0 and the power variance parameter p = ( + 2)∕( + 1) , see (2.6), are assumed to be the same for all policies i, this is a standard assumption in state-of-the-art use of these GLMs. An overview of GLMs and their parameter estimation within the EDF is given in "Appendix A".

Compound Poisson model with i.i.d. gamma claim sizes
We begin with the CPG model. Since the log-likelihood function of the CPG model decouples into two separate parts for claim counts and claim sizes, maximum likelihood estimation (MLE) of claim counts and claim size models can be done independently from each other. We start from n independent random variables The joint log-likelihood function of this model, given observations (N i ) i and (Z i,j ) i,j and weights (w i ) i , is given by where the term on the second line is zero for N i = 0 . Remark that in this log-likelihood function (for parameter estimation) we treat (N i ) i and (Z i,j ) i,j as known observations; for notational convenience we do not use small fonts for observations. From (3.1) we now see that we can estimate the Poisson parameters i and the gamma parameters and c i independently from each other; the former uses observations (N i ) and the latter observations Furthermore, we assume that each insurance policy i = 1, … , n is established with covariate information GLM for claim counts: Assume that the expected frequencies i = (x i ) of policies i = 1, … , n can be modeled by a log-linear regression function with regression parameter = ( 0 , … , d ) � ∈ ℝ d+1 . Assuming that the design matrix = (x 1 , … , x n ) � ∈ ℝ n×(d+1) has full rank d + 1 we find the unique MLE ̂ for by the (unique) solution of Remark that the Poisson distribution has an EDF representation with cumulant function (⋅) = 1 (⋅) = exp{⋅} . The lower index p = 1 in the cumulant function 1 (⋅) indicates that we have variance function V( ) = in the Poisson case, see also (2.5). The choice (3.2) corresponds to the canonical link ( � 1 ) −1 (⋅) = log(⋅) in the Poisson GLM. The choice of the canonical link implies that we receive an unbiased portfolio estimate, see [21]. The score Eq. (3.3) is solved numerically, for details see (A.3) in "Appendix A".
GLM for gamma claim sizes: Consider only insurance policies i which have claims, i.e. with N i > 0 . All subsequent considerations in this paragraph are conditional on N i . The average claim amount on policy i has a conditional gamma distribution with shape parameter N i and scale parameter c i N i (note that is not policy i dependent). This gamma distributed random variable has conditional mean and variance given by This model belongs to the EDF (2.2) with cumulant function 2 ( ) = − log(− ) for ∈ = ℝ − , dispersion parameter = 1∕ and exposure w i = N i . The conditional mean and variance are This is the boundary case p = 2 in Tweedie's CP model with power variance function V( ) = 2 , see (2.5).
We set up a second GLM for gamma claim size modeling. This second GLM does not necessarily need to rely on the same covariate space X as the Poisson GLM (3.2) for claim counts modeling. To emphasize this point, we introduce a new covariate space containing covariate information z i = (z i,0 , … , z i,q ) � ∈ Z ⊂ {1} × ℝ q having initial component z i,0 = 1 modeling the intercept. We interpret the choices X and Z as follows: both covariates x i ∈ X and z i ∈ Z should belong to the same insurance policy i, however, inclusion of individual covariate components and pre-processing of these components may differ in the two different regression models. This is a result of aiming at optimizing the predictive performance of both regression models.
We make the following regression assumption: choose a suitable link function g 2 (⋅) to receive the linear predictor, see also "Appendix A", for regression parameter ∈ ℝ q+1 . Formula (3.5) explains the relationship between mean = [Z|N] = � 2 ( ) , canonical parameter and linear predictor = (z) . Usually, one does not select the canonical link in the gamma GLM because the negativity constraint on the canonical parameter ∈ = ℝ − may be too restrictive in choosing a linear functional regression form; this is in contrast to the Poisson GLM (3.2). Therefore, the choice of the link function g 2 (⋅) has to be done carefully, because we require 1∕ i = −g −1 2 ( i ) = −g −1 2 ⟨ , z i ⟩ < 0 for all policies i = 1, … , n , otherwise the canonical parameter i is not in the effective domain . Below, we will choose the log-link for g 2 , which is a common choice for gamma GLMs.
These choices imply for the log-likelihood function, only considering policies The MLE ̂ of is found by solving the score equation, see "Appendix A", and with working residual vector R = (

Remarks 3.1
• Shape parameter may be treated as a hyper-parameter, and the explicit choice of does not influence parameter estimation because it cancels in the score Eq. (3.7). • MLE (3.6)-(3.7) is expressed in sufficient statistics Z i , and we receive the same regression parameter estimate ̂ if we perform MLE directly on the individual claim sizes Z i,j . This is an important property, namely, the gamma GLM can be fit solely on the number of claims N i and the total claim amount Z i on each policy i. Moreover, this estimated model still allows us to simulate individual claim sizes Z i,j . Thus, GLM regression parameter estimation does not differ whether we consider total claim amounts Z i or individual claim sizes Z i,j . On the other hand, the process of model and variable selection might give different results in the two estimation cases ( Z i vs. Z i,j ) because the log-likelihood functions and the estimates for differ, this is, e.g., important for model selection using likelihood ratio tests or Akaike's information criterion, see Remarks 3.10, below. • If we model claim counts and claim sizes separately, we use maximal available information N i and Z i,j . Moreover, we can design covariate spaces X and Z in an optimal way, and independently from each other. • If (3.5) is not based on the canonical link of the gamma model, the balance property will not be fulfilled, see [22]. This should be corrected by shifting the intercept parameter 0 correspondingly. Often one chooses the log-link for g 2 (⋅) , under the log-link choice we can also reformulate the regression problem by replacing the average claim amount response (3.4) by the (conditional) total claim amount S i | {N i } and treating log(N i ) as a known offset in the linear predictor. • Shape parameter < 1 leads to an over-dispersed model with strictly decreasing density, and for > 1 the density is uni-modal. Above is treated as a hyperparameter, and below we discuss MLE of .
• If shape parameter i needs explicit modeling as a function of i, then (3.7)-(3.8) will no longer have such a simple structure, and MLE of will depend on the explicit choices of i . In this case, one can either use a gamma DGLM or one can rely on the 2-dimensional exponential family. The latter model is less tractable numerically. It considers cumulant function ( 1 , 2 ) = log Γ( 2 ) − 2 log(− 1 ) for scale parameter c = − 1 > 0 and shape parameter = 2 > 0 . This gives inverse link function, see [21], the first component being the mean of the gamma distributed random variable Z, and the second component being the mean of log(Z) . We do not further follow up this approach because we would lose the connection to Tweedie's CP approach with a policy independent power variance parameter, see next section.
There remains estimation of shape parameter for given MLE ̂ . One could either use Pearson's dispersion estimate for 1∕ or directly calculate the MLE of . In view of (3.6), the MLE is obtained from score equation Either we solve this score equation numerically using the Newton-Raphson algorithm, or we plot the one-dimensional log-likelihood function ↦ (̂ , ) and determine the MLE ̂ from this plot, see Fig. 1, below, for an example.
We conclude by calculating Fisher's information matrix for ( , ) in our gamma GLM. We have, see "Appendix A", For the second derivative of the term we have where the second order derivative � (x) = d 2 dx 2 log Γ(x) of the log-gamma function is known as the trigamma function, see [10,Sec. 5.15]. The trigamma function is directly available in the statistical software R [13]. For the off-diagonal terms we have This gives us the following Fisher's information matrix for the gamma claim size modeling 1 3

Homogeneous dispersion case
From Sect. 2.2 we know that Tweedie's CP model belongs to the EDF, thus, GLM is straightforward. In this subsection we start with the homogeneous dispersion parameter > 0 case; this case will not be supported in Remarks 3.2, below. We assume having n independent random variables Y i ∼ Tweedie( i , w i , , p) , and we choose hyper-parameter p = ( + 2)∕( + 1) ∈ (1, 2) to make Tweedie's CP model consistent with the CPG case, see Proposition 2.1. Choosing a suitable link function g p (⋅) we make the following regression assumption for the linear predictor where x * i ∈ X * ⊂ {1} × ℝ d * are the covariates of policy i and * is the regression parameter. We change the covariate notation compared to Sect. 3.1 because covariate pre-processing might be done differently for Tweedie's CP model compared to the CPG case (because we consider different responses). In complete analogy with the above, MLE requires solving the score equations with design matrix = (x * 1 , … , x * n ) � , diagonal working weight matrix (using V( i ) = p i ) and working residual vector R = (

Remarks 3.2
There are a couple of crucial differences between Tweedie's CP approach with homogeneous dispersion and the CPG approach of the previous section: 1. The CPG approach of the previous section uses all available information of claim counts N i and claim sizes Z i , whereas Tweedie's CP approach with homogeneous dispersion parameter only uses total claim cost information Y i . 2. The former approach allows us to consider different covariate spaces X and Z for claim counts and claim size modeling, whereas the latter approach only relies on one version of the covariate space X * . 3. The mean estimates ̂ î i in the CPG case do not rely on the particular choice of the shape parameter , whereas in the homogeneous dispersion Tweedie's CP approach the mean estimates ̂ i rely on the specific choice of power variance parameter p = ( + 2)∕( + 1) through the working weight matrix W p , see (3.12). 4. In general, the dispersions resulting from CPG( i w i , , c i ) are not constant: The dispersion can only be constant if i = − i ∕c i does not depend on i. Typically, this is not the case, see also Conclusions and Remarks 3.9, below. Therefore, we need to extend the homogeneous dispersion case of Tweedie's CP model to a DGLM Tweedie's CP model, otherwise it cannot be compared to the CPG case, which is more flexible in dispersion modeling. For more analysis of the homogeneous dispersion case see [12].

Heterogeneous dispersion case
As stated in Remarks 3.2, the homogeneous dispersion Tweedie's CP approach does not use full information of claim counts and claim costs and it does not allow for flexible dispersion modeling i . In Section 2 of [17], the authors raise the point that in applications of Tweedie's CP model to insurance claim data it is important to use full information so that also the dispersion parameter i is modeled flexibly. As a consequence, the dispersion parameter cannot be factored out as in (3.12), and it does not cancel in optimization (3.11). Therefore, [17] propose to use the framework of DGLMs which was introduced and developed by [8,15,18]. DGLMs allow for simultaneous modeling of both mean and dispersion parameters by using a second GLM for the dispersion parameter i . The two GLMs are jointly calibrated using claim count and claim cost information. The joint density of a single case (N, Y) has been derived in formula (11) of [6]:

3), and
If we re-parametrize this joint distribution using mean parameter = � p ( ) = ((1 − p) ) 1∕(1−p) for total claim costs we arrive at the log-likelihood function In complete analogy with the above we determine the score equations w.r.t. and with variance function V( ) = p .

Proposition 3.3 Fisher's information contribution in the heterogeneous dispersion Tweedie's CP model w.r.t. ( , ) is given by Moreover, we have
Remark that in the above proposition we talk about Fisher's information contribution because the statement considers only one single random variable (Y, N). This is in contrast to (3.27) where we calculate Fisher's information matrix over the entire portfolio.
Joint MLE of and requires solving score Eqs. (3.14)- (3.15). This can be done by any suitable root search or gradient descent algorithm. In [17], this root search problem is approached using a slightly different representation, namely, by introducing a dispersion response variable D. This allows for a reformulation of the model in a DGLM form. We revisit [17] after proving Proposition 3.3.

Proof of Proposition 3.3
We start by calculating the means of the terms of the score in (3.15). We have and for the second term we receive

3
Making Tweedie's compound Poisson model more accessible From these two formulas it follows that, indeed, the score in (3.15) is a residual with mean zero. The cross-covariance terms are easily obtained by noting that also the score in (3.14) is a zero mean residual. This implies There remain the diagonal terms. For the first one we have, using integration by parts, For the second diagonal term we have, this provides the variance of the zero mean score in (3.14), This finishes the proof of Proposition 3.3. ◻ Thus, for MLE of and we need to consider the scores in (3.14)-(3.15), the latter one defining (unscaled) residuals w.r.t. the dispersion given by As mentioned above, solving score Eqs. (3.14)-(3.15) produce the MLEs for and ; basically, this finishes the MLE problem. In the remainder of this section, following [17], we rewrite this MLE problem. This different representation introduces a new (dispersion) response variable D, such that the root search problem can directly be related to Fisher's scoring method in a DGLM form. Choose square variance function V d ( ) = 2 and dispersion-prior weights This allows us to define so-called dispersion responses Fisher's information contribution (3.16) then reads as As emphasized by [16], orthogonality of and ( , p) , see (3.17), typically leads to fast convergence in estimation algorithms.

Remarks 3.4
• We start from the joint distribution of (N, Y), given in (3.13), for estimating ( , ) . This estimation problem is modified by considering a new response vector (Y, D), instead. The new dispersion response D, defined in (3.19), is not gamma distributed, but in view of score (3.20) we bring it into a gamma EDF structure with weight v > 0 , dispersion parameter 2 and square variance function V d ( ) = 2 , see also (2.5). In [17] it is mentioned that these definitions of v and D are somewhat artificial, but they bring this estimation problem into a DGLM form; note that this requires to include one dispersion term into the weight v and the response D, this means that we have an approximate score equation equivalence with a gamma MLE problem. In view of Proposition 3.3, we could also define dispersion response D differently by choosing an inverse Gaussian power variance function, i.e. V d ( ) = 3 , and defining the dispersion-prior weight correspondingly. This provides the same numerical solution for MLE, using an approximate score equation equivalence with an inverse Gaussian MLE problem. However, in this latter version the weights do not provide the right scaling for a distribution within the EDF. • Alternatively, we could try to estimate dispersion using Tweedie's deviance residuals Following [17], the squared residuals E 2 are approximately 2 1 distributed for sufficiently small, thus, they can be approximated by a gamma distribution with mean and variance 2 2 . Section 3.1 of [17] discusses this estimation approach. We do not further follow these lines because this approach does not use any claim count information and, therefore, does not benefit from full information (N, Y) as the CPG case. • There is a third alternative of including a dispersion estimation, and this third one is the one implemented in the R package dglm. This requires that the dispersion parameter is made policy dependent and then a DGLM is explored on (Y, E) by alternating the corresponding score updates. Also this approach does not benefit from full information (N, Y) (in contrast to the CPG model), and it is therefore not further explored in this manuscript.

Double generalized linear model in the heterogeneous Tweedie case
We use the heterogeneous dispersion Tweedie's CP approach and bring it into a DGLM form as described in the previous section. Choosing a suitable link function g p (⋅) we make the following regression assumption for the linear predictor of the mean upper indices * distinguishing the parametrization in Tweedie's CP GLM case from the individual models in Sect. 3.1. For the modeling of the dispersion parameter we choose a second link function g d (⋅) such that we have the linear predictor but still belong to the same policy i. MLE of ( * , * ) requires solving the score equations, see (3.14) and (3.20), For the definition of the dispersion-prior weights v i = v i ( i ) and the dispersion responses D i we refer to (3.18)- (3.19). Using Fisher's scoring method for estimating * and * , see "Appendix A", we explore the scoring updates where all terms on the right-hand side are evaluated at algorithmic time t, that is, . This also indicates how the two sets of parameters interact. Since parameters * and * are orthogonal, alternating the updates leads to fast convergence. Standard errors are obtained from the inverse of Fisher's information matrix There remains estimation of p. This is usually done by considering the profile log-likelihood for p, given optimal estimates of ( * , * ) , that is, we study p ↦ (̂ * (p),̂ * (p), p) where, in general, the MLEs ̂ * (p) and ̂ * (p) depend on the explicit choice of the power variance parameter p; for an example of a profile loglikelihood we refer to Fig. 1, below.

Remarks 3.5
• We emphasize that covariates may be chosen and pre-processed differently in the CPG and in Tweedie's CP models; this is indicated by choosing different notation for the covariate spaces (X, Z) and (X * , Z * ) , respectively. Different pre-processing of covariates might be necessary because we aim at optimally modeling different responses in the two models. This optimal modeling also includes good choices of link functions which may even imply that a CPG GLM does not lead to a Tweedie CP DGLM counterpart (or vice versa) because the linear predictor structure does not necessarily carry through general choices of link functions. In Sect. 3.3 we fully rely on log-links which allow for a one-to-one identification scheme between the different GLM frameworks. • The calculation of the terms of Fisher's information matrix involving p are a bit cumbersome, for this reason we do not give them explicitly. • As usual in MLE, typically, the dispersion parameters i will be under-estimated because MLE is not unbiased for variance parameter estimation, we refer to [17], Sects. 3.2 and 4.3. Using both total claim costs Y and claim counts N, the bias is often small, see [17].
We close this subsection by considering the special case of log-links for g p and g d . This special choice provides working weight matrices W p and W d • Theorem 3.6 is a very useful and strong result. In general, we have to run Fisher's scoring method for every power variance parameter p ∈ (1, 2) to find optimal MLEs ̂ * (p) and ̂ * (p) . In a second step, the optimal power variance parameter is found by considering the profile log-likelihood in p. Under the assumptions of Theorem 3.6 we only need to run Fisher's scoring method once to receive MLEs ̂ * and ̂ * (p) for a fixed power variance parameter p. All dispersion estimates for different power variance parameters are then directly obtained from Theorem 3.6, and mean parameter estimates do not vary in p. That is, we can directly maximize function q ↦ (̂ i ,̂ i (q), q) where the dispersion ̂ i (q) scales in q according to Theorem 3.6. • Theorem 3.6 also highlights that the heterogeneous dispersion case is fundamentally different from the homogeneous one. The mean estimates in the homogeneous case depend on the choice of the power variance parameter p through the working weight matrix W p in (3.12). In contrast to the heterogeneous dispersion case, a constant dispersion parameter does not leave any room to balance different p's through portfolio varying dispersions. On the other hand, under the assumptions of Theorem 3.6, the mean estimates are not p sensitive, which is equivalent to the CPG case.
Proof of Theorem 3. 6 The score equations for * and * are under log-link choices provide, see (3.14)-(3.15), for all insurance policies i.
Assume that i = i (p) and i = i (p) solve the above score equations for given power variance parameter p ∈ (1, 2) . Next, we choose power variance parameter q ≠ p , and define ̃ i = k i p−q i for some k > 0 . We plug i and ̃ i into the first score equation for power variance parameter q thus, the pairs ( i ,̃ i ) fulfill the first score equation. We now need to massage these pairs through the second score equation for power variance parameter q Next we apply that the pairs ( i , i ) solve the score equations for p. This provides for the score function of * Now we still have one parameter k > 0 that we can choose. We require This choice implies that (3.30) is equal to zero which follows from the fact that the pairs ( i , i ) solve the score equations for * = * (p) . This finishes the proof. In Remarks 3.10 we give a shorter proof. ◻

Relation between the two GLM approaches
We compare the CPG model to its counterpart being parametrized through Tweedie's CP model. To start off, recall formulas (2.6)-(2.9). The first formula gives relationship p = ( + 2)∕( + 1) ∈ (1, 2) . Since these two parameters are not modeled insurance policy dependent, we directly identify them. We start with the gamma claim size GLM of Sect. 3.1 using identification (2.7). The means are given by, see where we have used canonical link = ( � p ) −1 ( ) = − 1−p ∕(p − 1) . From identification (2.9) we have From identities (3.31)-(3.32) we conclude that for general link functions it is nontrivial to derive one parametrization from the other, i.e. this requires quite some feature engineering to bring the models in line (if possible at all). If we choose loglinks for g 2 , g p and g d (these are not the canonical links in all three cases but they are convenient because they preserve the right sign convention on the canonical scale) we can directly compare the linear predictors Formulating this differently gives us the following theorem.

3
of regression parameters and . The same holds true if we exchange the roles of the two models. • From the second identity of Theorem 3.8 we see that dispersion i is constant over all policies i if and only if the upper indices (−0) indicate that we exclude the intercept components x i,0 = z i,0 = 1 in these scalar products. Identity (3.33) gives the condition under which the assumptions of Sect. 3.2.1 are justified. However, in many practical insurance pricing examples, we find that the covariate space Z for claim sizes is strictly smaller than X used for claim counts modeling because certain factors only influence claim frequencies but are not significant for claim severities. In addition, often there are covariates that have opposite signs for claim counts and claim sizes. In all these cases (3.33) is not satisfied, and working under a constant dispersion assumption cannot be justified. • Assuming covariate relationship x * i = z * i we can re-parametrize the first log-likelihood (3.34) by setting + = (2 − p) * − * and + = −(1 − p) * + * , this gives us (we drop irrelevant terms) This proves under x * i = z * i = x i ∪ z i that the CPG model is nested in Tweedie's CP model and we have for given p = ( + 2)∕( + 1) this explicitly uses that we have the same data representation (N i , Y i ) i in both log-likelihoods.

Remarks 3.10
• Under the assumptions of Theorem 3.8 and additionally assuming that we receive an identity in (3.36). Since the mean estimates in the CPG case do not depend on the particular choice of the shape parameter , the same must hold true for Tweedie's CP DGLM model under identical covariates x i = z i = x * i = z * i . Using Proposition 2.1 we then receive the dispersion scaling of Theorem 3.6, thus, this gives us a second shorter proof for Theorem 3.6.
• If x * i = z * i = x i ∪ z i and x i ≠ z i , the CPG model is strictly nested in Tweedie's CP model and, in general, we do not get an identity in (3.36). In that case, Theorem 3.8 reflects an ideal world because noise in the data prevents MLE estimated parameters (estimated separately in both models) from strictly satisfying the identities in Theorem 3.8. • To perform model selection in the general case we can use Akaike's information criterion (AIC) [1]. This corrects both sides of (3.36) by the number of regression parameters involved, thus, with AIC the model with the smaller value should be preferred from either AIC applies because in both models we use the same data representation (N i , Y i ) i and both models are evaluated in the MLEs for the corresponding parameters. We emphasize that for estimating the CPG GLM we use in (3.36) sufficient statistics Y i = N iZi ∕w i . If, instead, we use the individual claim sizes Z i,j to estimate the CPG GLM, AIC does not apply because the log-likelihoods to be compared use the available data in different ways.

Numerical examples
We study two numerical examples to benchmark the two modeling approaches of Theorem 3.8. First, we design a synthetic data example that fully meets the assumptions of Theorem 3.8. Thus, there is no model uncertainty involved in this first (synthetic) example about underlying distributions, covariate spaces and link functions, and we can fully focus on estimating parameters with MLE in the CPG GLM and in Tweedie's CP DGLM. These results are then compared to neural network regression approaches on the same synthetic data. In contrast to GLMs, neural networks explore optimal covariate selection themselves. This is done in Sect. 4.2.2. Our second example in Sect. 4.3 is a real data example. This additionally raises the issue of model uncertainty because the real data has not been generated by a CPG model. Both examples are based on the motorcycle insurance data swmotorcycle used in [11], this data is available through the R package CASdatasets [3], see Listing 1 for an excerpt of the data. For the synthetic data we sample a portfolio of covariates from the original data, and then generate claims with a CPG GLM designed according to the assumptions of Theorem 3.8. For the real data example we fully rely on the swmotorcycle data and we use the corresponding claim observations.

Description of motorcyle data
We briefly describe the data, for more information we refer to "Appendix B", below. The data comprises comprehensive insurance for motorcycles which covers loss or damage of motorcycles other than collision, for instance, caused by theft, fire or vandalism. The data is aggregated on insurance policy level for years 1994-1998. The data is shown in Listing 1. We have applied some pre-processing, e.g., we have dropped all policies that have an exposure equal to zero.  The data is illustrated in "Appendix B".

Synthetic data example
This section is based on synthetic (simulated) data from a CPG GLM.

A generalized linear model approach
We start by describing the simulation of the synthetic data. We randomly choose n = 250 � 000 insurance policies from dat=swmotorcycle using the R code: Based on this portfolio we generate claims (N, Y) using two GLMs that fulfill the CPG assumptions of Theorem 3.8, the modeling details are specified in columns 1-3 of Table 1. We especially emphasize that the covariate spaces X and Z differ for claim counts and claim sizes.
CPG GLM We estimate the Poisson claim counts GLM and the gamma claim amounts GLM separately, according to Sect. 3.1 and under log-link choices. The results are presented in column 'estimated CPG' of Table 1, the brackets provide one estimated standard deviation received from the inverse of Fisher's information matrix. Note that we can estimate all regression parameters k and k without Shape param.  Fig. 1 (lhs) Log-likelihood ↦ (̂ , ) of the gamma GLM to estimate shape parameter for given ̂ ; (rhs) Tweedie profile log-likelihood p ↦ (̂ * ,̂ * (p), p) to estimate p specifying shape parameter > 0 explicitly. Most estimated parameters are within one standard deviation of the true parameter values. The true parameters have been chosen such that they resemble the true data swmotorcycle. The true data has an observed claim frequency of only 1.05%, see "Appendix B". In the present example, claims are scarce too, and the gamma claim size GLM has been estimated on (only) 2'795 claims. The parameter estimates are remarkably accurate (we do not have model uncertainty here, only parameter estimation uncertainty). We conclude that this model can be calibrated well using the separate approach for claim counts and claim amounts. Figure 1 (lhs) considers the log-likelihood function ↦ (̂ , ) of the gamma GLM to estimate shape parameter , we also refer to score Eq. (3.9). From this we find MLE ̂ = 1.56 , and the inverse of Fisher's information matrix provides an estimated standard deviation of 0.04 for this estimate. Thus, the estimated shape parameter is slightly too high, though still within two standard deviations of the true value of = 1.5 . We again highlight that this estimate is based on only 2'795 claims. Moreover, we remark that ̂ has been used in the standard deviation estimates of Table 1, see (3.8).
Tweedie's DGLM Next we turn our attention to Tweedie's CP case. The true values , and as well as their MLE counterparts ̂ , ̂ and ̂ from the CPG model are transformed with Theorem 3.8 to receive the same model in Tweedie's CP parametrization, this is illustrated in the first four columns of Table 2. In a first calibration step for Tweedie's CP model, we choose p = 1.39 which is the optimal power variance parameter estimate of the CPG model, see last line in column 4 of Table 2. We then calibrate Tweedie's CP DGLM model for this power variance parameter p with Fisher's scoring method (3.25)-(3.26); as starting values for the algorithm we use the estimates from the CPG model (in italic in Table 2). Fisher's scoring method converges in 7 iterations with these initial values. Due to (3.36) we receive a model that has a bigger log-likelihood than its CPG counterpart (we include all constants in this consideration so that the log-likelihoods are directly comparable).
In the next step, we optimize over the power variance parameter p. Therefore, we use Theorem 3.6, which says that the mean estimates ̂ i do not depend on p, and which provides the p-scaling for dispersion parameter MLEs ̂ i (p) . This allows us to directly plot the profile log-likelihood p ↦ (̂ * ,̂ * (p), p) as a function of p ∈ (1.36, 1.41) , see Fig. 1 (rhs). From this figure, we find maximizing value p = 1.39 , which is close to the true value of p = 1.4 . The second last column in Table 2 shows the resulting MLEs ̂ * and ̂ * (p) of the optimal Tweedie's CP model.
A first observation is that the parameter estimates from Tweedie's CP model are not as close to the true values as the MLEs from the CPG model. However, model selection should not be based on this observation: note that the (true) CPG model has 22 parameters and Tweedie's CP model has 33 parameters, therefore, we expect some differences in model calibration.
We summarize the two estimated models in Table 3. On row (a) we compare the log-likelihoods CPG (̂ ,̂ ,p) and Tw (̂ * ,̂ * ,p) of the estimated models CPG and Tweedie's CP, see also (3.36), to the one of the true model ( * , * , p) : we observe that both models slightly overfit to the data, with Tweedie's CP model having a slightly larger overfit [this is consistent with (3.36)]. Therefore, we penalize in AIC the log-likelihoods of the models by the number of parameters involved, see (3.37).
The AIC values are given on row (b) of Table 3, and we give preference to the CPG calibration. Performing a likelihood-ratio test having the CPG model as null    Table 3 gives the rooted mean square error (RMSE) between the true model means w i i and their estimated counterparts w î i = w î î i ; rows (d)-(g) show average means and dispersions as well as the corresponding standard deviations. We observe that these figures match the true values quite well. Recall that these figures are based on one simulation from the true model for each insurance policy, thus, they involve simulation error (but they do not involve model error because we only assume parameters , and p as unknown in this example). Moreover, we remark that the dispersion is not under-estimated, here, we also refer to the last bullet point of Remark 3.5. Finally, in Fig. 2 we plot the predicted means ̂ i against the true values i . The left-hand side compares the two estimated models against the true model, and the right-hand side compares the two estimated models against each other. From these plots we conclude that both models are very accurate, the CPG estimated one (orange) being slightly closer to the true model than its Tweedie's CP counterpart (green). Summarizing: This synthetic example gives evidence supporting industry practice on focusing on the CPG model. Specifying covariate spaces is easier in the CPG case because systematic effects of claim counts and claim amounts are clearly separated, and in our example accuracy is slightly higher because Tweedie's CP seems to slightly overfit in our example.

A neural network regression approach
Next we explore neural network regression models on the same synthetic data. Neural networks have the capability of representation learning which means that they can perform covariate engineering themselves, we refer to Sections 4 and 5 of [21]. Therefore, covariates can be provided in their raw form to neural networks. The neural networks then, at the same time, pre-process these covariates and predict the response variables. Starting from a GLM, the required changes to achieve this representation learning are comparably small. We illustrate this in the present section. Alternatively, one may also be interested in using generalized additive models (GAMs). GAMs are more flexible in modeling different functional forms in the components of the covariates compared to GLMs, however, they do not automatically allow for flexible interaction modeling between covariate components. For this reason, we favor neural networks over GAMs. We first define the (raw) covariate space X † which is going to be used throughout this section: where we use dummy coding for the categorical variable ∈ {0, 1} 4 . In contrast to Table 1, we do not specify the continuous variables in all its functional forms, but we let the neural network find these functional forms. A neural network is a function that consists of a composition of a fixed number of hidden network layers, each of them having a certain number of hidden neurons. For an explicit mathematical definition we refer to Section 3.1 in [21]. x † has the interpretation of being the raw covariate, and x = (x † ) ∈ ℝ d can be interpreted as the (network) pre-processed covariate. These pre-processed covariates (x † ) are then used in a classical GLM, e.g., for claim counts we may set for the log-link choice, see (3.2), note that we use a slight abuse of notation here because strictly speaking (x † ) does not include an intercept term for 0 , so this always needs to be added. Neural 3) involves regression parameters ∈ ℝ d+1 as well as network weights ∈ ℝ r which parametrize network function = . The dimension r of depends on the complexity of the chosen network . Network fitting now trains at the same time network parameter for an optimal covariate pre-processing as well as GLM parameter for response prediction. State-of-the-art fitting uses variants of the gradient descent algorithm, and a good performance depends on the complexity of , we just mention the universal approximation property of appropriately designed neural networks. For more information, we refer to the relevant literature, in particular, to [21]. Based on this reference we explore (4.3) and its counterparts for claim counts and Tweedie's CP model. In all three prediction problems we use the identical covariate space X † , and only network function will differ in the weights to bring covariates into the appropriate form for the corresponding prediction task.
Poisson claim counts We start by modeling claim counts using neural network approach (4.3). We use the R library keras to implement this, and we use exactly the same architecture as in Listing 4 of [14], the only thing that changes is the dimension of X † from 40 on line 1 of Listing 4 in [14] to 8 in the present example, see (4.1). This results in r = 655 and d + 1 = 11 parameters. We fit these parameters in the usual way by considering 80% of the data for training and 20% of the data for out-of-sample validation to track overfitting in the gradient descent algorithm (we run 100 epochs on batch size 5000). We then choose the parameter that has the best out-of-sample performance on the validation data. To this network solution we apply the bias regularization step of Listing 5 in [21] to make the model unbiased.
On rows (a1)-(a2) of Table 4 we present the results for the claim counts neural network model. We provide the Poisson deviance losses of the true model i (which is known here because we simulate from this model), the intercept model that does not use covariate information (i.e. is only based on intercept parameter 0 ), the claim counts GLM (upper part of Table 1) and its neural network counterpart. We observe that both regression models slightly overfit to the data 8.4366 ⋅ 10 −2 and 8.4393 ⋅ 10 −2 , respectively, compared to the true model loss of 8.4431 ⋅ 10 −2 .
On row (a2) we provide the RMSE between the true model means i and the estimated ones ̂ i . We note that the Poisson GLM has a smaller RMSE than the neural network Poisson regression model. This is not surprising because the Poisson GLM uses the right functional form (no model uncertainty) and only estimates regression parameter whereas the neural network regression model also determines this functional form for the raw covariates x † . In Fig. 3 (lhs) we compare the resulting estimated frequencies to the true ones on all individual insurance policies i = 1, … , n . From this plot we conclude that both models do a fairly good job because the dots lie more or less on the diagonal (which reflects the perfect model).
Gamma claim sizes Next we consider a neural network approach for the gamma claim sizes. This essentially means that we replace linear predictor (3.5) by the following neural network predictor (under a log-link choice for g 2 ) where is a neural network function (4.2) that may have the same structure as the one used for the Poisson regression model (4.3), but typically differs in network weights . For simplicity, we use exactly the same neural network architecture as in the Poisson case, only the exposure offset is dropped and the Poisson deviance loss function is changed to the gamma deviance loss function (including weights), in line with the distributional assumptions made.
The results are presented on rows (b1)-(b2) of Table 4 and Fig. 3 (rhs) (we run 1000 epochs on batch size 5000 and we callback the model with the smallest validation loss). Again we receive reasonably good results from the network approach, i.e., covariate engineering on X † is done quite well by the network, we emphasize that these results are based on only 2'795 claims. But we also see from Fig. 3 (rhs) that individual predictions spread more around the diagonal than in the gamma GLM case (where we assume perfect knowledge about the functional form of the (4.4)  Better accuracy can only be achieved by having more claim observations. Next, we estimate the shape parameter . This is done analogously to the gamma GLM case by plotting the corresponding log-likelihood ( ) as a function of . This gives estimate ̂ = 1.57 , which is slightly too large but still reasonable compared to the true value of = 1.5 . A too high shape parameter implies a too low dispersion, which is a sign of over-fitting to the observations.
We conclude with the summary statistics for the neural network approaches in Table 5 column 'estimated CPG', which look fairly similar to the GLM ones in Table 3. We obtain a larger RMSE, which is not surprising because we have more model uncertainty due to missing covariate knowledge, this is also obvious from Fig. 3.
Tweedie's compound Poisson neural network approach First, we remark that, in general, there is no simple comparison between a CPG and a Tweedie CP neural network approach similar to (3.34)-(3.35). The relation (3.34)-(3.35) is strongly based on the fact that we can directly compare linear predictors under suitable choices of covariate spaces. Since the networks given (4.2) transform covariates in a non-trivial way under non-linear activation functions, there is no hope to get an easy comparison between the models unless the network architectures are chosen in a very specific way, i.e. artificial way, so to say. Therefore, we do not aim to nest the CPG neural network into Tweedie's CP neural network model, but we directly focus on modeling the latter. This essentially implies that we have to replace linear predictors (3.21)-(3.22) by the following two-dimensional neural network predictors (under log-link choices for g p and g d ) where is a neural network function (4.2). The first component of ( , )(x † ) ∈ ℝ 2 + predicts the total claim costs Y and the second component estimates the dispersion parameter . We use one network to simultaneously perform this prediction task for mean and dispersion parameter. We implement this in the R library keras and we use the same architecture as in Listing 4 of [14], but we need to change the This requires a custom made loss function in keras for parameter estimation, the details are provided in Listing 2 in the "Appendix". We fit this model with the gradient descent algorithm exactly using the same methodology as outlined above (callback of the lowest validation loss model after 100 epochs on batch sizes 5000). In order to come up with the optimal neural network model we need to fit neural networks for multiple power variance parameters p, because there is no result similar to Theorem 3.6 that allows for a shortcut. Of course, this disadvantages Tweedie's CP neural network model from a computational point of view. We come up with an optimal power variance parameter estimate of p = 1.390 , which yields then the results in the last column of Table 5. From the figures on rows (c)-(g) we conclude that Tweedie's CP approach is not fully competitive with the CPG fitting. These differences are also illustrated in Fig. 4 with the CPG approach being slightly closer to the true model means. Nevertheless, all these estimates look very reasonable and the estimated neural network seems to capture the crucial features of the true model.
Conclusions from our synthetic data example Our findings support industry practice of focusing on the CPG parametrization. Our estimated models based on this parametrization are closer to the true model than the ones obtained from Tweedie's CP parametrization. If we work under GLM assumptions we need to pre-process covariates which is easier in the CPG parametrization because systematic effects of claim counts and claim amounts can be separated. If we work under neural network regression models, model calibration is not efficient under Tweedie's CP parametrization because we need to run gradient descent algorithms on multiple power variance parameters p to find the optimal model. Moreover, in our example, the CPG case leads to more accurate predictive models.

Real data example: an outlook
In view of the previous example everything seems to be fairly clear. However, our synthetic data is based on the very strong property of having gamma claim sizes with constant shape parameter over the whole insurance portfolio. This assumption may be critical in real insurance applications. We briefly analyze it in terms of our real data example given in "Appendix B", and we give an outlook in case this assumption is not fulfilled. We keep this section very short, and we mainly view it as a motivation to conduct future research in this direction. There are two possibilities in which the constant shape parameter assumption may fail, either the claim sizes are gamma distributed, but the shape parameter i is also insurance policy i dependent, or the gamma distribution is inappropriate due to that the claim sizes exhibit too heavy tails. We explore this on the real data example provided in "Appendix B". For this it suffices to focus on the gamma claim size model, i.e. we do not study claim counts in this real data example. Moreover, to minimize covariate pre-processing we explore a gamma neural network regression model on these claim sizes, the chosen model architecture is identical to the one used in (4.4), in particular, it does covariate engineering itself. Table 6 shows the (in-sample) gamma deviance losses of the intercept model and the neural network regression model. Obviously, the neural network approach has a better performance (note that the network model has been received by a proper training-validation analysis as described above). Using the resulting mean estimates ̂ i we can estimate the (constant) shape parameter . This is illustrated in Fig. 5 (lhs): we estimate ̂ = 0.75 . Thus, we receive a shape parameter smaller than 1, which provides over-dispersion 1∕� = 1.33 > 1 , i.e., the estimated gamma densities are strictly decreasing. This fact requires further examination because there might be two situations: either the true shape parameter is smaller than 1 (and everything is fine), or the claim sizes are more heavy tailed than a gamma distribution allows. This is typically compensated by over-dispersion in the estimated model. We analyze this warning signal on our real data. Figure 5 (rhs) gives the Tukey-Anscombe plot of the gamma deviance residuals against the fitted means. This plot supports the model choice because we cannot see any particular structure in the figure, it also supports the constant shape parameter assumption on . Figure 6 gives the QQ-plot and it compares the observed claims against one simulation from the fitted model. Also these two plots look quite reasonable, one may only question the upper tail of the QQ-plot.
Conclusions The short analysis on the real data has shown that for the motorcycle claims data the gamma claim size model is fairly reasonable, thus, supporting the CPG model. On different data, one may relax the constant shape parameter assumption on . This may result in a DGLM for gamma claim sizes (which is known in industry) and a Poisson GLM for claim counts. Again this model can easily be fitted in the Poisson-gamma parametrization, however, this approach does not have a Tweedie's CP counterpart relying on a fixed parameter p, giving more support to the industry preference of choosing the Poisson-gamma parametrization.

Conclusion
We have revisited the compound Poisson model with i.i.d. gamma claim sizes. This model allows for two different parametrizations, namely, the Poisson-gamma parametrization and Tweedie's compound Poisson parametrization. We have provided results for GLMs illustrating when the two parametrizations are identical, and we have provided a theorem that allows for efficient fitting of power variance parameters in Tweedie's parametrization (under log-link choices for the GLMs).
In the applied section, we have analyzed why the insurance industry gives preference to the Poisson-gamma parametrization. Based on examples, we find that, indeed, this parametrization is easier to fit, and results turn out to be more accurate in our examples. In particular, under neural network regression models we give a clear preference to the Poisson-gamma parametrization because Tweedie's version does not possess an easy and efficient way in estimating the power variance parameter. That is, the Tweedie version is computationally clearly lacking behind the Poisson-gamma case.
For our real data example it turns out that the gamma claim size model with constant shape parameter is quite reasonable. However, in many other applications this is not the case. Therefore, insurance industry explores double GLMs for a flexible modeling of shape parameters of claim sizes; on the other hand, a case-dependent p modeling in Tweedie's compound Poisson parametrization is not (easily) feasible. For modeling more heavy tailed claim sizes, mixture models are a promising proposal.

A Generalized linear models
GLMs have been introduced in [9], and they have been studied in the monograph [7]. GLMs are based on the EDF (2.2). The EDF has been studied extensively in [2,4,5], and its properties have been revisited in [21]. The original introduction of EDF distributions (2.2) is constructive from which it follows that the effective domain is a convex set and that the cumulant function is a smooth and convex function on the interior of the effective domain ̊ . Moreover, we get the following moments for Y having EDF distribution (2. with design matrix = (x 1 , … , x n ) � ∈ ℝ n×(d+1) . MLE system (A.3) is solved either using Fisher's scoring method or the iteratively re-weighted least squares (IRLS) algorithm, see [7,9]. For Fisher's scoring method we explore the scoring updates where all terms on the right-hand side are evaluated for algorithmic time t. It has been pointed out by an anonymous referee that the R command glm() does not directly calculate the inverse of the matrix ′ W in (A.4), but, instead, solves a linear system for t+1 . The motivation for this approach is that in high-dimensional covariate spaces or in the situation of multiple categorical variables with many labels (implemented by dummy coding), the matrix ′ W may be close to singular and, henceforth, inversion of this matrix may lead to unstable results. Standard errors are obtained from the inverse of Fisher's information matrix where ∇ 2 denotes the Hessian w.r.t. . The IRLS algorithm replaces the inverse Fisher's information matrix I( ) −1 = ( � W ) −1 in the scoring updates by the inverse of the observed information matrix

B Motorcycle data example
We start with a descriptive and exploratory analysis of the Swedish motorcycle data of Listing 1. We have n = 62 � 036 insurance policies with positive exposures w i > 0 . The empirical claim frequency is ̄= ∑ n i=1 N i ∕ ∑ n i=1 w i = 1.05% , and the average claim size is ̄= ∑ n i=1 ∑ N i j=1 Z i,j ∕ ∑ n i=1 N i = 24 � 641 Swedish crowns SEK. Figure 7 shows a boxplot over all exposures w i and the claim counts N i on all insurance policies. We note that insurance claims are rare events for this product, because the claim frequency is only ̄= 1.05%. Figures 8 and 9 give the marginal total exposures (split by gender), the marginal claim frequencies and the marginal average claim amounts for the covariate components Age, Zone, McClass, McAge and Bonus. The first observation is that we have a very imbalanced portfolio between genders, only 11% of the total exposure is coming from females. The empirical claim frequency of females is 0.86% and the one of males is 1.08%. We note that the female claim frequency comes from (only) 61 claims (based on an exposure of female of 7'094 accounting years, versus 57'679 for male). Therefore, it is difficult to analyze females separately, and all marginal claim frequencies and claim sizes in Figs. 8  jointly for both genders. Average claim sizes are 18'237 SEK and 25'270 SEK for female and male, respectively. The empirical marginal frequencies in Figs. 8 and 9 (middle) are complemented with confidence bounds of two standard deviations (blue dotted lines) and the empirical overall frequency ̄= 1.05% (orange color). From the plots we conclude that we should keep the explanatory variables Age, Zone, McClass and McAge, but the variable Bonus does not seem to have any predictive power. At the first sight, this seems surprising because the bonus-malus level encodes the past claims history. The reason that the bonus-malus level is not needed for our claims is that we consider comprehensive insurance for motorcycles covering loss or damage of motorcycles other than collision (for instance, caused by theft, fire or vandalism), and the bonusmalus level encodes collision claims. The situation for average claim amounts is a bit more difficult to understand, but we make a similar conclusion, namely, that we can drop the covariate Bonus. Moreover, we merge Zones 5-7 because of small exposures and similar behavior. Figure 10 shows the correlations between the covariates: (lhs) correlations between continuous covariates, (plots rhs), dependence between continuous covariates and the categorical Zone covariate. We have some dependence, for instance, in Zone 1 (three largest Swedish cities) motorcycles are more light (McClass) and less old. Older people drive less heavy motorcycles that are more old, and older motorcycles are less heavy. Figure 11 gives the empirical density, empirical distribution and log-log plot of average claim amounts Z i . From the log-log plot we conclude that the average claim amounts are not heavy tailed, which does not reject the use of gamma claim size distributions at that stage.