EM estimation for bivariate mixed poisson INAR(1) claim count regression models with correlated random effects

This article considers bivariate mixed Poisson INAR(1) regression models with corre-latedrandomeffectsformodellingcorrelationsofdifferentsignsandmagnitudeamongtimeseriesofdifferenttypesofclaimcounts.Thisistheﬁrsttimethattheproposed familyofINAR(1)modelsisusedinastatisticaloractuarialcontext.Forexposi-torypurposes,thebivariatemixedPoissonINAR(1)claimcountregressionmodels withcorrelatedLognormalandGammarandomeffectspairedviaaGaussiancopula arepresentedascompetitivealternativestotheclassicalbivariateNegativeBinomial INAR(1)claimcountregressionmodelwhichonlyallowsforpositivedependence betweenthetimeseriesofclaimcountresponses.Ourmainachievementisthatwe developnovelalternativeExpectation-Maximizationtypealgorithmsformaximum likelihoodestimationoftheparametersofthemodelswhicharedemonstratedtoper-formsatisfactorywhenthemodelsareﬁttedtoLocalGovernmentPropertyInsurance FunddatafromthestateofWisconsin.


Introduction
Over the past decade, there has been a growing literature on bivariate (and/or multivariate) claim count regression models which can efficiently capture the dependence between claims from the same policy and/or different coverages bundled into a single policy. The interested reader is referred to Refs. [1-7, 9, 11, 13, 15, 23-25, 28, 29], among many others.
Pechon et al. [23] proposed the use of bivariate mixed Poisson count regression models, with correlated random effects for capturing the interactions between the different coverages purchased by members of the same household. In particular, Pechon et al. [23] considered the bivariate Poisson-Gamma (BPGGA) regression model with Gaussian copula and the bivariate Poisson-Lognormal (BPLN) regression model. In the former model the random effects are distributed according to two Gamma distributions with unit means and the dependence between the random effects is introduced by means of a Gaussian bivariate copula whereas in the latter model these random effects are distributed according to the bivariate Lognormal mixing distribution. Bermúdez et al. [4], following the setup of Pedeli and Karlis ([26] and [27]), were the first to derive a bivariate Poisson integer-valued autoregressive process of order 1 (BINAR (1)) claim count regression model which can account both for cross-sectional and temporal dependence between multiple claim types. The model they developed was employed for addressing the ratemaking problem of pricing an insurance contract in the case of positively correlated claims from different types of coverage in non-life insurance. Finally, Bermúdez and Karlis [5] built on the previous paper by using a multivariate INAR(1) (MINAR(1)) regression model based on the Sarmanov family of distributions. The MINAR(1) regression models based on the Sarmanov family of distributions are also restricted to a positive correlation structure between the claim count response variables. However, it enjoys some advantages compared to a different approach which can allow for both positive and negative correlations by using copulas for the specification of the joint distribution of the innovations. See, for instance, Refs. [8,17,18,[20][21][22] among others. Firstly, it avoids identifiability issues which may arise when a continuous copula distribution is paired with discrete marginals, see Ref. [12]. As is well known, the lack of identifiability means that it cannot be guaranteed that model fitting is unique and this may lead to problems in statistical inference, for example, one might receive no meaningful values for the standard errors of the parameters. Secondly, the computational intensity for discrete copula-based models increases as the dimension of the model increases and hence, as is also mentioned by the authors, their approach, which relies on the use of the Sarmanov family, provides models that are less computationally intensive to estimate and can still have a reasonable range for positive dependence structure between the claim count responses.
In this study, we introduce a family of bivariate mixed Poisson INAR(1) claim count regression models with correlated random effects for modelling the dependence structure between times series of different types of claim counts from the same and/or different types of coverage. The bivariate mixed Poisson INAR(1) regression models with correlated random effects are a broad class of models which can accommodate overdispersion, which is a direct consequence of unobserved heterogeneity due to systematic effects in the data, and correlations of different signs and magnitude.
For demonstration purposes, we consider the bivariate mixed Poisson INAR(1) claim count regression models which are derived by using the bivariate Lognormal and Gaussian copula paired with gamma marginals as mixing densities, which we refer to as BINAR(1)-LN and BINAR(1)-GGA claim count regression models respectively. Both models can be regarded as extensions of the classical bivariate Negative Binomial INAR(1) claim count regression model with a shared gamma random effect, which we refer to as BINAR(1)-GA claim count regression model, in the sense that they provide more flexibility for modelling overdispersed bivariate time series of count data compared to the BINAR(1)-GA model which is derived by pre-imposing the restrictive positive correlation assumption between time series of different claim types of claim counts, since in some cases negative correlations may be of interest as well. Furthermore, unlike previous copula-based count regression models for which identifiability issues can arise when a continuous copula distribution is paired with discrete marginals, in the proposed family of models identifiability of the bivariate distribution of the innovations is guaranteed by imposing a unit mean constraint for the Gamma continuous mixing densities which are paired with a Gaussian copula.
The main contributions we make are as follows: • Firstly, before we introduce the time series components, we present a unified framework for statistical inference via the Expectation-Maximization (EM) algorithm for the BPGA, BPLN and BPGGA regression models. 1 • Secondly, we develop novel EM type algorithms for maximum likelihood (ML) estimation of the BINAR(1)-GA, BINAR(1)-LN and BINAR(1)-GGA regression models, which has not been explored in the literature so far. The main reason for this is because the joint distribution of the innovations cannot be written in closed form in either model and hence its maximization is not possible via standard numerical optimization as is done in Refs. [4,5,17,26,27]. The rest of the paper is organized as follows. Section 2 presents the model specifications for the bivariate mixed Poisson regression models we consider and describes their ML estimation via the EM algorithm. Section 3 presents the derivation of their INAR(1) extensions that we first proposed herein and outlines the EM type algorithms we developed for statistical inference. Section 4 presents our empirical analysis which is based on the LGPIF dataset. Concluding remarks are given in Sect. 5. The interpretation of abbreviations used in the paper and some other technical details are provided in Appendix A.1.

Model specifications
The bivariate Poisson mixture is constructed by two independent Poisson random variables conditional on a random effect vector (or scalar) θ = (θ 1 , θ 2 ) such that N (i) ∼ Pois(λ i,t θ i ), i = 1, 2. The bivariate mixed Poisson regression is then constructed by further allowing the rate λ i to be modelled as functions of explanatory variables z i,t such that λ i,t = exp{z T i,t β i }. Denote the mixing density function of the random effect as f φ (θ ) parametrized by φ. To avoid the identifiability issue, we have to restrict the expectation E[θ i ] to be a fixed constant. One usually lets E[θ i ] = 1 so that λ t := (λ 1,t , λ 2,t ) will fully explain the frequency of a event and φ will explain the variation and correlation of the whole bivariate sequence. In the following, we will discuss three different mixing densities, univariate gamma (shared random effect), bivariate Lognormal and Gaussian copula paired with Gamma marginals.

(a) Univariate Gamma density
In this case, the bivariate mixed Poisson regression model shares the same random effect Denote the mixing density function as f φ (θ ) = f φ (θ ) and it has following expression which has unit mean and variance 1 φ . Then the unconditional probability mass function f PG (k, t) of N t := (N (1) t , N (2) t ) can be written down in a closed form 2 ) and covariance matrix Then the random effect vector θ = e = (e 1 , e 2 ) has Lognormal distribution with unit mean. Denote the density function of as f N and f L N for Lognormal density. Then they have the following expressions The unconditional distribution f P L N (k, t) of N t is expressed as a double integral All the double integrals with respect to Lognormal density f L N can be transformed into double integrals with respect to normal density f N so that they can be evaluated by Gauss-Hermite quadrature. See details in the Appendix A.2.
(c) Gaussian copula paired with Gamma marginals Suppose now the random vector θ is distributed as a meta Gaussian copula such that its marginals are two independent Gamma random variables with parameter (φ 1 , φ 2 ) respectively. Define uniform random vector u = (F φ 1 (θ 1 ), F φ 2 (θ 2 )). The distribution function F GC (θ ) and density function f GC (θ) can be written as where f ρ (., .), F ρ (., .) are the density function and cumulative distribution of bivariate normal random variable with the following expression The (x) is the cdf of standard normal random variable with −1 (x) as its quantile function and f sn (x) is the density function of the standard normal random variable. Finally, f φ i (x) and F φ i (x) are the pdf and cdf of Gamma density function defined in 1 for i = 1, 2. Then a bivariate Poisson Gamma random vector is constructed Then the double integral can be evaluated by Gauss-Legendre quadrature. See details in Appendix A.3.

The EM algorithm
For statistical inference of above model, the classical maximum likelihood estimation is not straightforward to apply because the log likelihood function is not computational tractable and its maximum likelihood estimators are not straightforward to achieve. Alternatively, we can apply the EM algorithm to estimate the parameters = {β 1 , β 2 , φ}. For given random samples (k 1 , ...k n ), suppose now we observe the random effect (θ 1 , . . . , θ n ), then the complete likelihood function c ( ) is given by Compared to ( ), the complete log likelihood function c ( ) are simplified in the sense that there is no integration and mixture likelihood are decomposed into Poisson likelihood and the likelihood for mixing density. However, to evaluate c ( ) we need to find out the conditional (posterior) distribution of θ given the random samples. Then we define η(θ|λ t , k t ) = e −λ 1,t θ 1 −λ 2,t θ 2 θ k 1,t 1 θ k 2,t 2 and posterior density Then posterior expectation for any real value function h(θ) is given by where f Po (k|λ) = e −λ λ k k! is the probability mass function of a Poisson random variable with rate λ and the condition ( j) means that the posterior density function is evaluated with the parameters estimated at j-th iteration. The subscript θ of E ( j) θ,t means that the expectation is taken with respect to the θ for t-th observation.
• E-step: Evaluating the Q function Q( ; ( j) ) given the the parameters estimated at j-th iteration • M-step: After finding out the Q function, we update the parameters for the next iteration, ( j+1) , which can be achieved by finding the gradient functions g(.) and the Hessian matrix H (.) of Q functions and then apply the Newton-Raphson algorithm to maximize the Q function for the next iteration. The parameters can be updated separately as Poisson part β 1 , β 2 and random effect part φ.
-For the Poisson part -For the random effect part, we need to derive the first and second order derivatives of log f φ (θ ) and then the take posterior expectation to construct its gradient functions and Hessian matrix. In the following, we derive the derivatives for those three mixing densities defined in the last session. Different mixing densities will affect the way we calculate the posterior expectation, and in many cases, we have to rely on numerical evaluation. However, some posterior expectations can be simplified to reduce computational cost when implementing the EM algorithm in practice.
(a) Univariate Gamma density This can be regarded as a special case because the posterior density is known in closed form as another univariate Gamma density with different parameters.
Then, the posterior expectation when updating β i can be simplified as Finally, to update φ where and (x) are digamma and trigamma functions respectively. The posterior expectation E (b) Bivariate Lognormal density In this case, there is no analytic expression for the posterior density. However, it can be transformed in the following way Then, all posterior expectations with respect to θ can be transformed into expectations with respect to such that E Under this transformation, all the posterior expectations can be evaluated by Gauss-Hermite quadrature. Furthermore, where the subscript r denotes the r-th element of a vector and r , s denotes r-th row s-th column entry of a matrix. The first and second order derivatives are given by Notice that all the derivatives are in the linear form of 2 1 , 2 2 , 1 , 2 , 1 2 . Hence, we can evaluate these posterior expectations in each iteration once to avoid repeating calculations.

(c) Gaussian copula paired with Gamma marginals
In this case, there is no simplification either for the posterior density or for the posterior expectation. To update φ = {φ 1 , φ 2 , ρ}, we have almost the same procedure as for the bivariate Lognormal case.
where the first and second order partial derivatives are given by

Model specifications
Let X and R be non-negative integer-valued random vectors in R 2 . Let P be a diagonal matrix in R 2×2 with elements p i ∈ (0, 1). The bivariate first-order integer-valued autoregressive model (Bivariate INAR (1)) is defined as where the thinning operator • is the widely used binomial thinning operator such that can be easily written down as Note that p i • X i,t and p j • X j,t , i = j are independent of each other. To adapt the heteroscedasticity arising from the data, R t is bivariate mixed Poisson regression model such that R i,t ∼ Po(λ i,t θ i ) defined in the last session. The joint distribution of the bivariate sequence X t+1 conditional on the last state X t is given by where f R (k, t) is a probability mass function of a bivariate mixed Poisson regression model with mixing density f φ (θ ). Under this construction, the bivariate sequence X t is correlated with each other and its correlation structure mainly depends on the correlation structure of innovation R t .

The EM algorithm
Similarly, the maximum likelihood estimation is not straightforward to apply as the log likelihood function (23) Notice that there are still unobserved random variables θ in R t . In some of the examples we discuss in the last section, f R (k, t) may not have analytic expression and hence we would like to further break down the likelihood function. Suppose further that we observe the random effect {θ t } t=1...n , then the complete log likelihood becomes Define the following posterior density functions Define the posterior expectations with respect to real-value functions h(., .) • E-step: Evaluating the Q function Q( ; ( j) ) given the the parameters estimated in the j-th iteration, After breaking down the log likelihood function, it is obvious that except for the log likelihood contributed by binomial distribution, the rest of the terms are almost the same as that of the Q-function of bivariate mixed Poisson regression model discussed in the last session, which means the updating procedure for β i , φ will be exactly the same, but we need to evaluate different posterior expectations in this case.
• M-step: Similarly, we apply the Newton-Raphson algorithm to update the parameters. Based on the structure of Q( ; ( j) ), the parameters can be updated separately for binomial part p, Poisson part β i and random effect part φ -The binomial part can be updated simply as the following gradient function has a unique solution , X i,t = 0 and X i,t−1 = 0 0, other wise See Appendix A.4 for the derivation of y Note that when the mixing density f φ (θ ) is univariate Gamma, the posterior expectation for θ has a simple expression -Similarly, for the random effect part φ, (a) Univariate Gamma density (c) Gaussian copula paired with Gamma marginals The partial derivatives inside expectations are derived in the last section.
Remark This model as well as the EM algorithm can be extent to multivariate case straightforwardly. All the steps and the general form of the formula of the EM algorithm in the multivariate case are exactly the same. The only problem is that it would become cumbersome to evaluate the transition probability P(X t |X t−1 ) as dimension of X t increases.

Data description and model fitting
The data used in this section come from the Local Government Property Insurance Fund (LGPIF) from the state of Wisconsin. On previous application on this dataset, interested reader can refer to Refs. [28][29][30]. This fund provides property insurance to different types of government units, which includes villages, cities, counties, towns and schools. The LGPIF contains three major groups of property insurance coverage, namely building and contents (BC), contractors' equipment (IM) and motor vehicles (PN, PO, CN, CO). For exploratory purposes, we focus on modelling jointly the claim frequency of IM, denoted as X 1 , and comprehensive new vehicles collision (CN), denoted as X 2 as they are both related to land transport. The insurance data cover the period over 2006-2010 with 1234 policyholders in total. Only n 1 = 1048 of them have complete data over the period 2006-2010, which will be the training dataset. The last year 2011 with n 2 = 1025 policyholders, which is the same set of policyholders as in the training dataset, out of 1098 policyholders will be the test dataset. Denote the IM type and CN type claim frequency for a particular policyholder as X (h) 1,t , X (h) 2,t respectively, where h is the identifier for each policyholder and t is the year. Then the relationship between X i,t and X (h) i,t is simply X i,t = h X (h) i,t with i = 1, 2. Some basic statistical analysis is shown in the following Table 1 and Fig. 1. The proportion of zeros for two types of claims are all over 90% during 2006-2010. Both types of claim shows overdispersion as the variance are all higher than their mean over years and the overdispersion for X 2,t is even stronger than that of X 1,t , which indicate the need to apply overdispersed distribution model for the data. The correlation tests over years imply that it is reasonable to introduce correlation structure between X 1,t and X 2,t . The proportion of zeros and kurtosis show that the marginal distributions of X 1,t , X 2,t are all positively skewed and exhibit a fat-tailed structure which indicates the appropriateness of adopting a positive skewed and fat-tailed distribution (Log Normal distribution). Last but not least, the correlation tests illustrated in Table 2 do support the appropriateness of introduction of time series term in modelling the claim sequence. The description and some summary statistics for all the explanatory variables (covariates z 1,t , z 2,t ) that are relevant to X 1,t , X 2,t are shown in Table 3. Variables 1-5 including 'TypeVillage' are categorical variables to indicate the entity types of a policyholder. Due to the strongly heavy-tailed structure appearing in variables 6 and 9, which can drastically distort the model fitting, those variables are transformed by means of the 'rank' function in R software and then standardized, which can mitigate the effect of outliers. Variables 6-8 are relevant to IM claim X 1,t , while variables 9,10 provide information for CN claims X 2,t . The covariate z 1,t includes variables 1-8, and z 2,t contains variables 1-5 and variables 9,10. These covariates act as the regression part for λ i,t mentioned in Sect. 2, which may help to explain part of the heterogeneity within X 1,t and X 2,t .
Due to the large computational cost for evaluating the partial derivatives of copula case (large sample size), all the models except the copula case discussed in Sects. 2 and 3 are applied to model the joint behaviour of X 2,t across all the policyholders. Instead, a simulation study in the Appendix A.5 shows that the EM algorithm does work for copula case. The test is a one-sided test where the alternative hypothesis is "The sample correlation is greater than 0" Since we would like to model the whole behaviour rather than the individual one, the the likelihood function would simply become where h ( ) is the log likelihood function for policyholder h. Note that all the policyholders with the same type of claim X i, will share the same set of parameters  { p 1 , p 2 , β 1 , β 2 , φ}. In addition, it is necessary to show the appropriateness of introducing crosscorrelation and autocorrelation in BINAR(1) model. Then we also fit the data to following models.

The joint distribution of X
where θ 1 and θ 2 are independent random variables, either Gamma or Log Normal.

The joint distribution of X
where R i,t ∼ Pois(λ i,t θ i,t ), i = 1, 2 and random effect θ i,t is independent of i and t.
For comparison purpose, we fit these univariate and bivariate Poisson mixture models with training dataset starting from 2007 because they do not need to consider the lag responses. When it comes to the initial values, we use the following. Lag one correlation of each sequence serves as the initial value of p i . We fit a Poisson generalized linear model for each sequence to obtain the initial values of β i . Finally, we used the moment estimates of the bivariate Poisson mixture model (without regression) for initial values of φ = {φ 1 , φ 2 , ρ}. All the estimation is performed in R software where we implement the EM algorithms derived in previous sections. The standard deviations of the estimators are calculated by inverting the observed information of matrix from the incomplete log-likelihood function (the log likelihood function without unobserved latent variables).
Model fitting results are shown in Table 4. Within the same class of models, compared to univariate Gamma as mixing density, the Log Normal case allows more flexible structures to capture different distributional behaviour within two types of claims. Hence we can observe the improvement of AIC from univariate Gamma case For the class TMP and TINAR, row and column stand for mixing density θ 1 , θ 2 , respectively. The bold cells indicate the best one within the same class of models to Log Normal case and hence it is no surprise that the Log Normal is always the best choice within the same class of model. Among different classes of models, it is clear that the adoption of autocorrelation component significantly improves the model fitting. Finally, the significant improvement in terms of AIC from TINAR to BINAR, as well as from TMP to BMP, indicates that it is appropriate to introduce cross correlation between two sequences X 1 , X 2 . The estimated parameters via EM algorithm are shown in Table 5.

Predictive performance
In insurance claims modelling, it is more useful to check the overall distribution for all policyholders rather than prediction of the claim frequency for each policyholder, which can be used for premium calculation, risk management, and so forth. To evaluate the predictive performance, we then calculate the predicted claim frequencies Freq(X t |X t−1 ,ˆ ), which are the sum of individual probabilities P(X 2 ) ∈ {(i, j), 0 ≤ i, j ≤ 10} based on the estimated parameters, and compare these to the observed frequencies from the test sample (X 1,2011 , X 2,2011 ) (year 2011). In addition to our proposed BINAR model with Log Normal mixing density, we also compute predictive performance of the best TMP, TINAR, BMP models from Table 4 as the benchmark for comparison purposes. Based on the predictive claim frequencies, one can also compute expected number of claims marginally and measure the Predictive Sum of Square error: For each entry, the upper one is the estimate and the estimated standard deviations are indicated in round brackets All the results are summarised in Tables 6 and 7 and it is clear that our proposed model, bivariate INAR(1), has the best predictive performance with the smallest PSSE among all other models. Furthermore, TLL result shows that the bivariate INAR outperforms all other models, which is consistent with the model fitting result in Table 4.

Application to ratemaking
In this subsection, the analysis of best fitted models from Table 4 for ratemaking is conducted. We select three representative risk profiles under different models, named Good, Average and Bad, illustrated in Table 8. These three risk profiles are selected according to the sign and size of the coefficients in Table 5 . They have following explicit formulae Tables 9 and 10 summarise the mean and variance under different risk profiles and different claim history structure (X (h) 2,t−1 ). As TMP and BMP do not depend on claim history, their mean and variance are all the same within the same risk profiles. It is interesting to see that the variance of INAR models are smaller than that of mixed Poisson models in many cases.

Concluding remarks
In this paper, we consider a new family of bivariate mixed Poisson INAR(1) regression models for modelling multiple time series of different types of claim counts. The proposed family of models accounts for bivariate overdispersion and, similarly to copula-based models, allows for interactions of different signs and magnitude among the two count response variables without using the finite differences of the copula representation which may result in numerical instability in the ML estimation procedure. For illustrative purposes, we derived the BINAR(1)-LN and BINAR(1)-GGA regression models which can be regarded as competitive alternatives to the BINAR(1)-GA regression model for modelling time series of count data. Furthermore, from a computational statistics standpoint, the EM type algorithms we developed for ML estimation of the parameters of all the models were easily implementable and were shown to perform well when we exemplified our approach on LGPIF data from the state of Wisconsin. At this point, it should be noted that we considered the bivariate case and the Gamma and Lognormal correlated random effects for expository purposes. Moreover, the EM estimation framework we proposed is sufficiently flexible and can be used for other continuous mixing densities with a unit mean and, unlike copula-based models, which also allow for both positive and negative correlations, generalizations to any vector size response variables are straightforward. However, in the latter case, EM estimation may be chronologically demanding due to algebraic intractability. Nevertheless, in such cases, due to the structure of the EM algorithm for multivariate INAR(1) models with correlated random effects, the E-and M-steps can be executed in parallel across multiple threads to exploit the processing power available in multicore machines.
Finally, an interesting topic for further research would be to also take into account cross autocorrelation, proceeding along similar lines as in Ref. [4].
where X ∼ N (μ, σ 2 ), is first to make a linear transformation of integrand and then apply the quadrature rule directly: where ξ i are the roots of Hermite polynomial of degree n, with a certain weight w i . The quadrature rule approximation of integral will be accurate only when the function h can be well-approximated by a polynomial of degree 2n − 1 or less. Those values can be found from the R function gauss.quad in the package statmod. The idea to extend the result to high dimensional setting is straightforward. Specifically, we need to first transform the density function into the form exp{y T y}, where y ∈ R k×1 , then the k-dimensional integral reduces to a k-fold Gauss-Hermit integral. Suppose the k -dimensional random vector X ∼ N (μ, ), where μ ∈ R k×1 and ∈ R k×k . Then a linear transform for this random vector is through eigen decomposition of such that

A.5 Simulation study for Gaussian copula paired with Gamma marginals
This is to verify that the EM algorithms work for both the bivariate Poisson mixture model and the bivariate INAR model when the random effect is characterized by copula. Two random samples of size 500 are generated from these two models separately with pre-specified parameters, which are listed in the Table 12 where μ 1 = (1, 0.3, 0.5) T , μ 2 = (1, 0.2, 0.4) T , D = diag{0, 1, 1} and MVN stands for multivariate normal distribution. Then each model is fitted by two methods: maximising the incomplete likelihood and EM algorithms. These two methods should give almost the same results for p, β 1 , β 2 which determine the mean of the model and hence have a relatively large contribution to likelihood. On the other hand, this may not be the case for other parameters φ 1 , φ 2 , ρ which determine the variation and correlation of the model, and only contribute relative small part of the likelihood. The final log likelihood values would normally be larger than the log likelihood value evaluated at pre-specific parameters. The estimated results are given in Table 13 The difference between estimated parameters φ 1 , φ 2 , ρ and their true values seems larger than others. This is reasonable because these parameters control the variation of distribution and the log likelihood would be less sensitive to them.

Table 13
Simulation study for two estimation methods, where OPTIM is the optimization function implemented in R with conjugate gradient (CG) as method