1 Introduction

Bivariate count data regression models are appropriate in situations where paired counts are correlated and joint estimation is required. In this paper, in the application section, an insurance claim count dataset is used to model the number of claims for two different types of claim.

The starting point in the literature for modeling bivariate count data is the bivariate Poisson regression model. We denote this bivariate Poisson distribution as \(BP(\lambda _1,\lambda _2,\lambda _3)\), where \(\lambda _1 + \lambda _3\) and \(\lambda _2 + \lambda _3\) are the marginal means and \(\lambda _3\) is the covariance parameter. Based on this distribution, there are many applications in economics and related fields. However, this distribution has two limitations: It does not allow for overdispersion (marginal distributions follow a Poisson distribution and thus the conditional mean of each count variable equals the conditional variance) or a negative correlation (the model only accommodates for nonnegative correlation between the counts).

As in the univariate case, when using mixed Poisson models, bivariate count models can be generalized in different ways to allow for overdispersion. Mixtures of bivariate Poisson distribution can be considered in at least two different ways. In the first, we start with a \(BP(a \lambda _1, a\lambda _2, a \lambda _3)\) distribution where a follows some distribution. Munkin and Trivedi (1999) described multivariate mixed Poisson regression models based on this type of mixing and a gamma mixing distribution. Gurmu and Elder (2000) used an extended gamma density as a mixing distribution. Ghitany et al. (2012) proposed a general EM algorithm to facilitate estimation for the multivariate negative binomial, Poisson inverse Gaussian and Poisson lognormal regression models. The second way starts with a \(BP(a_1 \lambda _1, a_2 \lambda _2, a_3 \lambda _3)\) distribution, but now assuming that the vector \((a_1,a_2,a_3)^T\) follows a trivariate (or bivariate if we assume that \(\lambda _3=0\)) distribution. The case when \(\lambda _{3}=0\) has received attention primarily because of its simplicity. Steyn (1976) proposed the use of a bivariate normal distribution as the mixing distribution. Later on, Aitchison and Ho (1989) proposed using the bivariate lognormal distribution instead of the simple bivariate normal distribution.

In both of the above models, the specification of the random-effects distribution is continuous although it may be a discrete or a finite distribution. Karlis and Meligkotsidou (2007) considered the latter case assuming that the joint distribution for the random effects is a finite distribution, i.e., the case in which only a finite number of points have positive probabilities. Such an assumption gives rise to a finite mixture of multivariate Poisson distributions. In Bermúdez and Karlis (2012), a finite mixture of bivariate Poisson regression models was proposed to allow for overdispersion. The novelty of this approach lies in the fact that it is assumed that the data consist of subpopulations with different regression structures. A potential use of such a model is for examining the clustering of observations. The model can also be used to model negative correlation and embraces zero-inflated bivariate Poisson regression models.

In parallel, there are other models that are designed to allow for negative correlations. A bivariate Poisson distribution that allows for negative, zero or positive correlations was introduced by Lakshminarayana et al. (1999) with a distribution defined as the product of Poisson marginals with a multiplicative factor parameter. Extension to other marginal are also possible (Famoye 2010; Zamani et al. 2016). Berkhout and Plug (2004) derived an alternative count data model using conditional probabilities. More recently, Genest et al. (2018) proposed an alternative model extending the trivariate reduction to achieve negative correlation.

A different approach involves modeling dependence through copula functions to allow for greater flexibility for the dependence structure. As an example, Cameron et al. (2004) defined bivariate count distributions from copula functions. Shi and Valdez (2014) proposed different multivariate negative binomial regression models using copulas to model insurance claim counts.

Finally, it is worth mentioning that count data often display more zero outcomes than would be expected with the Poisson regression model. Indeed, the zero-inflation phenomenon is a very specific type of overdispersion, and zero-inflated Poisson regression models have been used to handle zero-inflated count data (Li et al. 1999; Wang et al. 2003).

The present paper proposes copula-based bivariate finite mixture regression models, extending the knowledge gained from the papers above, particularly from Bermúdez and Karlis (2012) and Shi and Valdez (2014). First, using a finite mixture, the proposed model allows for overdispersion (embracing zero-inflation) and, simultaneously, for an elegant interpretation based on the typical clustering application of finite mixture models. Second, by using copula functions, it allows for a more flexible dependence structure, including negative correlations.

Finite mixtures with copulas were introduced in Kosmidis and Karlis (2016), and the current paper follows their main ideas. Here, for each cluster, a bivariate distribution defined through copulas is used, extending the previous reference by allowing for regressors as well. In particular, a bivariate distribution is used for each cluster, with marginals defined as having any discrete distribution where their means are related to covariates while the two marginals are joined via some copula function.

This new approach, using copulas, offers a series of improvements such as allowing a bivariate regression to be used for each cluster that can also have a negative correlation, allowing for a wide range of marginal distributions that means not necessarily restricting ourselves to simple bivariate Poisson distributions, and allowing for different dependence structures via the selection of copulas. In parallel, using a finite mixture, this approach allows for cluster analysis, and hence, for a better interpretation of count data analysis.

We describe our approach in Sect. 2 and provide an estimation with an EM algorithm in Sect. 3. In order to show the aforementioned improvements, a real data application is given in Sect. 4. Finally, Sect. 5 presents our concluding remarks.

2 The proposed model

We base our model on copulas. Copulas are quite common for continuous data but not widely used for discrete data (see, e.g., Nikoloulopoulos 2013, for a discussion of the challenges in applying copulas to multivariate counts). In the present paper, we follow the approach in Kosmidis and Karlis (2016) where the multivariate distribution in the finite mixture is constructed via copulas. As a novelty, we assume a finite mixture of such distributions but also allowing for covariates in order to consider heterogeneity. Our approach generalizes the work in Bermúdez and Karlis (2012) where mixtures of bivariate Poisson regressions were assumed by considering a less flexible dependence structure. In order to expand the applicability of this model, our new approach can allow for different marginal distributions for each type of claim and different dependence structures via the selection of copulas.

2.1 Copulas

A copula is a function \(C(\cdot )\) that can “couple” the marginals to model the dependence structure. A p-dimensional copula \(C(u_1, \ldots , u_p)\) is a distribution function with uniform marginals. The importance of copulas in statistical modeling stems from Sklar’s theorem (see Nelsen 1997, §2.3), which shows that every multivariate distribution can be represented via the choice of an appropriate copula and, more importantly, that a copula provides a general mechanism for constructing new multivariate models in a straightforward manner. Note that for continuous distributions the copula is unique, but for the discrete case the copula is not unique, which is generally not a problem in applications (see the discussion in Genest and Nešlehová (2007)).

For instance, consider the probability mass function \(P(x;\mu _k)\), where \(\mu _k\) denotes the parameter(s) of each distribution, for \(k=1,\ldots ,p\). The cumulative distribution function (cdf) is given by

$$\begin{aligned} F(x;\mu _k)=\sum _{m=0}^x P(m;\mu _k), \quad k=1,\ldots ,p . \end{aligned}$$

We can then couple the marginals to add a dependence structure. For example, we can have a bivariate distribution with marginals \(F(x_1;\mu _1)\) and \(F(x_2;\mu _2)\), given by

$$\begin{aligned} F(x_1,x_2;\mu _1,\mu _2,\alpha )\equiv C(F(x_1;\mu _1),F(x_2;\mu _2);\alpha ), \end{aligned}$$
(1)

where \(\alpha \) now denotes the copula parameter(s) that explain(s) the association between the two random variables. This is a well-defined distribution function with a dependence structure depending on the chosen copula. However, to describe the joint probability mass function, we need to calculate finite differences from the cdf shown above. For the general case, when working with copulas for discrete data, the probability mass function is given in expression (1.2) in Panagiotelis et al. (2012), and results from finite differences in the distribution function. In its general form, it is written as

$$\begin{aligned} P(\mathbf{x}) = \sum _m \mathrm{sgn}(\mathbf{m}) C(F_1(m_1), \ldots , F_p(m_p)), \end{aligned}$$
(2)

with \(\mathbf{m} = (m_1, \ldots , m_p)\) vertices, where each \(m_t\) is equal to either \(x_t\) or \(x_t - 1\) \((t = 1, \ldots , p)\), and

$$\begin{aligned}\mathrm{sgn}(\mathbf{m}) = \left\{ \begin{array}{rc} 1\,, &{} \text {if } m_t = x_t - 1 \text { for an even number of } t\text {'s} \\ -1\,, &{} \text {if } m_t = x_t - 1 \text { for an odd number of } t\text {'s} \\ \end{array} \right. \, . \end{aligned}$$

For example, for the bivariate case we have

$$\begin{aligned} P(X_1=x_1,X_2=x_2;\mu _1,\mu _2,\alpha )&= F(x_1,x_2;\mu _1,\mu _2,\alpha ) \nonumber \\&\quad - F(x_1-1,x_2;\mu _1,\mu _2,\alpha ) \nonumber \\&\quad -F(x_1,x_2-1 ;\mu _1,\mu _2,\alpha ) \nonumber \\&\quad + F(x_1-1,x_2-1;\mu _1,\mu _2,\alpha ) \end{aligned}$$
(3)

where \(\mu _1,\mu _2\) refer to marginal parameters and \(\alpha \) to the copula parameter(s). In practice, since copulas are distribution functions, this is a bivariate integral calculated in some rectangular bivariate space. Depending on the choice of copula, this can be simplified to avoid the need for integration.

2.2 The finite mixture model

Consider the general case with p variables. We want to model the vector \(\mathbf{X} = (X_1, \ldots ,X_p)^T\). The probability mass function of a finite mixture model is defined as

$$\begin{aligned} P(\mathbf{x}; {\varvec{\theta }}, {\varvec{\pi }}) = \sum _{j = 1}^G \pi _j P_j(\mathbf{x}; {\varvec{\theta }}_j) \, , \end{aligned}$$
(4)

where \({\varvec{\theta }} = ({\varvec{\theta }}_1^T, \ldots , {\varvec{\theta }}_G^T)^T \in \Theta _1 \times \ldots \times \Theta _G\), and \(\pi _j \in (0,1)\) with \(\sum _{j = 1}^G \pi _j = 1\). Appropriate choices of \(P_j(\mathbf{x}; {\varvec{\theta }}_j)\) can result in flexible and low complexity models. The book of McNicholas (2016) provide a detailed treatment of the framework of mixture modeling.

We emphasize that \({\varvec{\theta }}_j\) is a vector containing all the parameters associated with the j-th component. Hence, it contains both the parameters from the marginals and the association parameters of the copula.

Our novel approach assumes that \(P_j(\mathbf{x}; {\varvec{\theta }}_j)\) is a bivariate distribution defined through a copula. The probability mass function is calculated using (3) with marginal distributions that are some discrete distribution with a mean \(\lambda _{ijk}\) for the i-th observation, j-th component and k-th variable, not necessarily from the same family. We assume that \(i=1,\ldots ,n\) (i.e. we have n observations), \(j=1,\ldots ,G\) (i.e., G components) and \(k=1,2\) for the two variables at hand. In addition, we add covariates in the mean of the k-th marginal distribution and hence

$$\begin{aligned} \log \lambda _{ijk} = \beta _{jk}^T {\varvec{\omega }}_{ik}, \end{aligned}$$

where \(\beta _{jk}\) is a vector of regression coefficients. Also note that the covariates in \({\varvec{\omega }}_{ik}\) refer to the k-th variable and they maybe different for different \(\lambda \)’s (variables); for simplicity here we assume that they are the same and drop the second subscript. Note that the marginal distributions may depend on some other parameter; for example, in the case of a negative binomial marginal, we can have an overdispersion parameter \(s_{jk}\) for the j-th component of the k-th variable.

Note that one may also consider covariates on the mixing proportion vector \(\mathbf \pi = (\pi _1,\ldots ,\pi _{G-1})\) based on some multinomial regression models. We do not pursue this on the present paper.

Hence, now we can further define the vector of component specific parameters as \({\varvec{\theta }}_j = ( \beta _{j1}, \beta _{j2}, \alpha _j, \phi _{j1}, \phi _{j2} )\) where \(\beta _{jk}\), \(k=1,2\), are the regression coefficient vectors for the j-th component, \(\alpha _j\) is the copula parameter for the j-th component, and \(\phi _j\)’s relate to other parameters associated with the j-th component and each and depend on the choice of the marginal distributions. They may differ if the marginals assumed are not the same.

Finally, we need to define the copula function to couple the marginal distributions. In the present paper, we use three different copulas but we emphasize that the literature contains several other choices that reflect different types of association between the marginals. First, we use a Frank copula with a cdf given by

$$\begin{aligned} C_\alpha (u,v) = -\frac{1}{\alpha } \left( 1 + \frac{(\exp (-u \alpha )-1)(\exp (-v \alpha )-1)}{(\exp (-\alpha )-1)} \right) , \end{aligned}$$
(5)

where \(0 \le u,v\le 1\), \(\alpha \in (-\infty ,\infty ) - \{ 0 \}\). In fact, \(\alpha \) is the copula parameter, and being negative implies negative correlation. Second, we also use a Gaussian copula defined as

$$\begin{aligned}C_{\alpha }(u,v)=\Phi _{\alpha }\left( \Phi ^{-1}(u),\Phi ^{-1}(v)\right) ,\end{aligned}$$

where \(\Phi \) is the N(0,1) cdf, \(\Phi ^{-1}\) is the functional inverse of \(\Phi \) and \(\Phi _{\alpha }\) is the bivariate standard normal cdf with correlation \(\alpha \). Finally, we use a Gumbel copula defined as

$$\begin{aligned} C_\alpha (u,v) = \left[ \left( {{\tilde{u}}}^\alpha + {{\tilde{v}}}^\alpha \right) ^{1/\alpha } \right] , \end{aligned}$$
(6)

with \({{\tilde{u}}} = -\log (u)\) and \({{\tilde{u}}} = -\log (v)\).

So, for example, for the bivariate Poisson based on the Frank copula we need to evaluate (3) using (5) and the cdf of Poisson distributions for F.

We emphasize that the choice of marginal distributions is a separate task from the dependence modeling and hence one may choose different marginals that offer considerable flexibility. Also the copulas need not be the same for each component of the mixture. For a thorough discussion about such a model, see the discussion in Kosmidis and Karlis (2016).

3 ML estimation via an EM algorithm

In this section, we describe ML estimation of the model. Being a finite mixture of regressions type model, estimation is based on an EM algorithm. The additional complexity here is the use of a copula function. As in Kosmidis and Karlis (2016) we can still apply the EM algorithm with limited additional complexity. The algorithm is as follows.

As usual we assume as unobserved data the vectors \(\mathbf{Z}_i=(Z_{i1}, \ldots , Z_{iG})\) with values \(Z_{ij}=1\) if the i-th observation belongs to the j-th component and 0 otherwise. The EM proceeds as usual, by estimating the unobserved component memberships at the E-step and maximizing the complete likelihood at the M-step. So the E-step is the typical step that estimate the component memberships. It turns out (see Kosmidis and Karlis 2016) that the M-step consists of G weighted likelihood steps. Formally, the algorithm is as follows.

Suppose that a sample of n p-vectors \({\varvec{x}}_1, \ldots , {\varvec{x}}_n\) is available, which are assumed to be realizations of independent random variables \({\varvec{X}}_1, \ldots , {\varvec{X}}_n\) each distributed with a probability mass function as defined by (4) and (3). The maximization of the likelihood function based on that sample can be performed using the expectation–maximization algorithm. At the \(\ell \)-th iteration of the algorithm \((\ell = 2, \ldots )\),

  • E-step: Calculate

    $$\begin{aligned} w_{ij}^{(\ell + 1)} = \frac{\pi _j^{(\ell )} P_j(\mathbf{x}_i; {\varvec{\theta }}_j^{(\ell )})}{ \sum _{j = 1}^G \pi _j^{(\ell )} P_j(\mathbf{x}_i; {\varvec{\theta }_j}^{(\ell )})} \quad i = 1, \ldots , n; j = 1, \ldots , G \,.\end{aligned}$$
  • M-step 1: Set

    $$\begin{aligned}\pi _j^{(\ell + 1)} = \frac{\sum _{i = 1}^n w_{ij}^{(\ell + 1)}}{n} ~~j = 1, \ldots , G\,. \end{aligned}$$
  • M-step 2: Maximize

    $$\begin{aligned} l({\varvec{\theta }}; {\varvec{x}}_1, \ldots , {\varvec{x}}_n) = \sum _{j = 1}^G \sum _{i = 1}^n w_{ij}^{(\ell + 1)} \log \left\{ P_j(\mathbf{x}_i; {\varvec{\theta }}_j) \right\} \, ,\end{aligned}$$

    with respect to \({\varvec{\theta }}\) to get an updated value \({\varvec{\theta }}^{(\ell + 1)}\) for the copula and marginal parameters. Note that this step implies that we have to maximize G weighted likelihoods, one for each component, using w’s as weights. Hence the complexity and the difficulty of this step is equivalent to fitting a bivariate copula-based regression model for the chosen marginal distributions. Also since we maximize in parallel G weighted likelihoods, efficient evaluations by parallel computing are feasible.

In the above \(P_j(\cdot )\) can be any bivariate discrete distribution suitable to model bivariate count data. In our case, we use bivariate Poisson and negative binomial distributions defined through copulas. Note that to allow for covariates we fit negative binomial distributions with mean \(\lambda \) and variance \(\lambda + \lambda ^2/ s\), which implies that s is the overdispersion parameter. For \(s \rightarrow \infty \), we get the Poisson model as special case, so in some sense the Poisson model is included. An interesting use of copulas is that the marginals may be of different type. For example, one may assume a Poisson distributions for the first marginal and a negative binomial for the second one.

The algorithm iterates between the E-step and the M-step until some convergence (or lack of progress) criterion is satisfied. For the example in the current paper the terminating criterion that is used is that the relative increase of the log-likelihood in two successive iterations is less than a small positive real \(\epsilon = 10^{-8}\).

Table 1 Cross-tabulation of data

Initial values can be set as described in Kosmidis and Karlis (2016) but also in more heuristic ways, such as by fitting models sequentially. See also Papastamoulis et al. (2016) for a relevant model.

4 Application

4.1 The data

We used the same automobile insurance claims dataset as used in Bermúdez and Karlis (2012). The data refer to a sample from an automobile portfolio of a major insurance company operating in Spain in which only cars categorized as being for private use were considered. For our purpose here, we selected policyholders with full coverage, i.e., policies including third-party liability (claimed and counted as \(N_1\) type), a set of basic guarantees such as emergency roadside assistance or legal and medical assistance (claimed and counted as \(N_2\) type) and, finally, comprehensive coverage (damage to one’s vehicle caused by any unknown party, for example, damage resulting from theft, flood or fire) and collision coverage (damage resulting from a collision with another vehicle or object when the policyholder is at fault), also claimed and counted as \(N_2\) type. The cross-tabulation for the number of claims for third-party liability (\(N_1\)) and number of claims for the rest of the guarantees (\(N_2\)) is shown in Table 1.

The data contain information from 28,590 policyholders. For each policy, the initial information and the total number of claims (for the two types of claim) from policyholders at fault were reported on an annual basis. Nine exogenous variables were considered for ratemaking purposes. The exogenous variables are described in Table 2. For the vectors \((N_{i1},N_{i2})\), we assume the finite mixture model defined through copulas that include covariates.

Table 2 Explanatory variables used in the models

4.2 Results

We fitted several different models ranging from simple Poisson regression without dependence to different copula-based finite mixture regression models. Table 3 presents a summary of the models of increasing complexity together with their AIC and BIC values. We started from two independent Poisson regressions (Double Poisson) and two independent negative binomial regressions (Double Neg. Bin.), and then we added some covariance between the two types of claim (bivariate Poisson and bivariate negative binomial). The improvement in the loglikelihood was large when we added zero inflation for the (0, 0) cell (zero-inflated bivariate Poisson). To account for overdispersion, caused by unobserved heterogeneity as discussed in Bermúdez and Karlis (2012) and here, later on, when discussing clustering effects, the next model was the 2-finite mixture bivariate Poisson model which also improved our fit. Finally, following our copula-based approach and a Frank copula, we obtained the copula-based finite mixture of a bivariate Poisson and a negative binomial. The improvement from the Poisson marginals to the negative binomial marginals was limited but indicated that the additional effort and complexity of a negative binomial may be needed. Two additional copula-based finite mixtures of negative binomials, coupled with Gumbel and Gaussian copulas, were fitted without obtaining improvement with respect to the Frank copula model.

In sum, copula-based models provide the best AIC/BIC, indicating that they can capture the features of the data more accurately. Finite mixture models with more than two components (\(j>2\)) were also fitted, but no improvement in terms of AIC or BIC was achieved, demonstrating that for the data at hand two components are sufficient and that the portfolio is comprised of two groups of policyholders. Note also that for all models all the available covariates were used, and no variable selection was carried out, i.e., the presented model is the one with all available information. We would like to point out that we have used all variable in all components and marginal distributions. Hence the significance need not to be judged from the standard errors along, as we shall see in Table 4.

Table 3 Various models fitted to the data

From Table 3, we can see that the best fit was obtained with the copula-based 2-finite mixture model with negative binomial marginals coupled with a Frank copula. For negative binomial components, we assumed a parametrization that leads to a dispersion parameter s. However the improvement over the same model with Poisson marginals was rather small. Table 4 shows the estimated regression coefficients for the negative binomial model. If we look at the parameter s, we note that for the second component and the first marginal, the estimated value is too large, implying that the negative binomial tends to a Poisson distribution. Note that the construction of our model through copulas allows the use of different marginal distributions.

The copula-based models also allow a more flexible structure (i.e., negative correlation for the Poisson distribution). For the present application, we used different copulas. The Frank copula allows for both positive and negative dependence and is computationally easier since it does not involve integrals. We also ran Gaussian copulas which also allow both positive and negative dependence at the cost of computational burden as they evaluate the bivariate normal integral many times, as well as a Gumbel copula to allow and check for tail dependence. The Frank copula provided the best results with the lowest computational cost.

Table 4 shows the results obtained by fitting the 2-finite mixture of bivariate negative binomial marginals coupled with the Frank copula. Coefficient regressors and their standard errors are provided for each component and type of claim. The standard errors reported were obtained from the Hessian matrix. The dispersion parameters (s) of the negative binomial marginals, mixing proportion (\(\pi \)) and copula parameter (\(\alpha \)) of each component are also shown. The last column refers to the p-value for a likelihood ratio test for testing whether the entire variable can be removed from the model. Such a test checks whether the variables has any merit for the model or not.

Table 4 Estimated coefficients and standard errors (* indicates a Z-score between −2 and 2) from the negative binomial model with the Frank copula. The last column refers to the p-value for a Likelihood Ratio Test for testing whether the entire variable can be removed from the model

In parallel, Fig. 1 shows the estimated means for this model for all the clients for the two components. Clients were assigned to groups based on their maximum a posteriori probability, i.e., using the \(w_{ij}\)’s and assigning to the group with larger probability for each client i. One can see that the two components are clearly separated. The group separation is characterized by a low mean with low variance (hereinafter policyholders considered as “good" drivers) and a high mean with high variance (policyholders considered as “bad" drivers).

Fig. 1
figure 1

Scatterplot of the fitted components

Focusing on the mixing proportion, one can see that most policyholders belong to the second component (that is, most of them are “good" drivers). Without going back over aspects related to dispersion parameters that have already been discussed, we would like to point out that the negative binomial model allows for larger variance within each component. This is appropriate for the first component (“bad drivers") which has large variance for the two types of claim. However, for the second component (“good drivers"), especially for the \(N_1\) claims, the overdispersion parameter of the negative binomial was too large, implying a Poisson marginal. Finally, from the copula parameter, we can see that the first component (“bad drivers") shows a lower correlation between type of claims than the second component (“good drivers").

It is worth giving careful consideration to this separation between “good" and “bad" drivers. In the context of car insurance, the Poisson regression model is usually rejected because of the presence of overdispersion and an excess of zeros. According to literature, these features of the data may be caused by differences in driving habits and behavior among policyholders that cannot be observed or measured a priori by the insurer. This phenomena is known as the problem of unobserved heterogeneity. For example, at the moment of signing a policy, the insurer ignores the driving ability, driving aggressiveness or the degree of obeying traffic regulations of a new client. Through the use of finite mixtures of regression models, we intend to address the problem of unobserved heterogeneity assuming that group separation can be explained by these hidden driving characteristics. In this case, we may consider those policyholders with a safer driving behavior to belong to the second component and, hence, may be considered as a “good” drivers, while policyholders with a more dangerous driving behavior can be considered as a “bad” drivers and belong to the first component.

A common assumption in all the models used in the literature to address the problem of unobserved heterogeneity (compound frequency models and their zero-inflated or hurdle versions) is that all policyholders behave in the same way with regard to the observed (or a priori) variables, and thus they all have the same regression structure. The models proposed in this paper account for unobserved heterogeneity by choosing a finite number of subpopulations and allowing each component in the discrete mixture to have its own score, i.e., each group of policyholders can have a different regression structure with regard to the observed variables. To sum up, by applying finite mixtures of regression models, we aim to examine whether unobserved risk variables, such as a driver’s reflexes, aggressiveness, or compliance with Highway Code, establish the existence of subpopulations of policyholders with different a priori behavior.

As discussed above, here we may assume the existence of two groups of policyholders, “good" and “bad" drivers, defined by their respective driving behavior. From Table 4, one can also see that each group of policyholders exhibits a different regression structure. This is particularly noticeable for the variable ZON2 with regard to the number of claims for the rest of guarantees (\(N_2\)). For policyholders considered to be “bad” drivers, driving in a higher risk zone (basically due to climate conditions) increases the probability of making a claim; whereas it decreases for “good” drivers. This is reasonable if we focus on their driving habits: “bad” drivers are more likely to make a claim in a zone with bad climate conditions because they drive aggressively and ignoring traffic conditions; while “good” drivers are even more careful with such a climate conditions and so they are less likely to have a claim. It is worth emphasizing that using simpler models, i.e., with one component, this covariate may not be significant as they only estimate one coefficient and the effect is canceled out, estimating an average effect that is close to zero.

Figure 2 shows the average posterior probabilities conditional on observed values of the two claims. Namely, each cell refers to a pair of the type \((N_1,N_2)\), \(N_1,N_2=0,1,2,3,4\) and shows the posterior probability that such an observation belongs to the group 1 or 2 averaged over the different observations with such values. One can see that, as expected, the second component (“good drivers") is more likely for those cells with less claims for both types, while the 1st component (“bad drivers") corresponds to those cells with a growing number of claims. Indeed, we may observe that “good drivers" are more likely to have a claim for the rest of guarantees than having a third liability claim. As a classification rule observation, policyholders with very few counts (specially for third liability claims) are more probable to come from second component, while for large counts we are almost certain that they belong to the first component.

Fig. 2
figure 2

Posterior probabilities conditional on observed values of the two claims. Rows refer to \(N_1\) and columns to \(N_2\). Each cell refers to a pair of the type \((N_1,N_2)\), \(N_1,N_2=0,1,2,3,4\) and shows the posterior probability that such an observation belongs to the component (group) 1 or 2, “bad" and “good" drivers respectively

4.3 Prediction

In order to see the predictive ability of the model, but also to discuss about the added value offered by the proposed models, we have run a prediction experiment with the data. Namely, we kept randomly 20% of the data as a test set, then we fitted models to the rest (80%) and we predicted the ones in the test set based on the models derived. This firstly can show the predictive ability of the model out-of-sample but also reveals the different features of the data. In Table 5, we present results for some of the models used in Table 3. Particularly, we have fitted a double Poisson model and a double negative binomial model as baseline models; the second one captures the overdispersion we have seen in the data. Then, we fitted a 2-finite mixture of bivariate Poisson distributions, this can recover part of the correlation but also the overdispersion. Finally, from those models based on copulas we selected one with a Frank copula and negative binomial marginals.

Table 5 shows the observed frequencies and the expected frequencies for the models. It is obvious that the double Poisson fails a lot, the double negative binomial corrects this in a certain extend but it cannot capture all the features of the data. Note that the marginals are closer to the observed ones but the model fails a lot for the bivariate distribution. The 2-finite mixture of bivariate enhances a lot and accounts for the overdispersion but it is still far from the observed ones. Finally, the copula model with negative binomial marginals is closer to the observations. Overall the copula-based model predicts quite well the observed data. A further important aspect is that while the observed correlation for the test set is 0.15, the double Poisson and the double negative binomial, as expected, predict a value almost zero, while the 2-finite mixture of bivariate Poisson overestimates the correlation giving a number close to 0.22. The copula model is closer from all other models with a correlation equal to 0.16. Summarizing, the copula-based model predicts quite well the cases because it can provide characteristics of the observed data closer to the real ones due to the inherent flexibility that it offers.

Table 5 Observed and predicted frequencies for out-of-sample predictions for the 20% of the observations (\(n=5,718\)) for a series of models

5 Concluding remarks

We have presented a new approach for modeling bivariate claim data based on a finite mixture of bivariate discrete distributions defined through a copula, also allowing for covariates. In our opinion, this approach offers several advantages. On the one hand, allowing for a more flexible dependence structure we can obtain a better fit and a better predictive capacity. On the other hand, through the use of finite mixture, the model further allows for a clustering analysis that tries to address the problem of unobserved heterogeneity.

The extension of this approach to higher dimensions is more demanding. The problem is that not all the copulas can be extended in a rigorous way to higher dimensions, at least not in such a way as to allow for a fully flexible dependence structure. For example, in the case of the Frank copula, while it is possible to expand it to multidimensional cases, it may involve only one common dependence parameter, which is more restrictive. The Gaussian copula is more suitable for such generalization at the additional computational cost required to evaluate many multidimensional integrals. A possible solution would be to create an approximate pairwise model and to estimate the parameters with some other method like composite likelihood. Such an approach is being considered in ongoing research.

In terms of the actuarial implications, the above model may allow premiums to be extracted in the same way as in Bermúdez and Karlis (2012) which would help to account correctly for the dependencies between the claims. Moreover, there might be a substantive interest in classifying new clients into “good" and “bad" risks and, hence, estimating premiums according to their specific regression structure derived from the finite mixture. However, current approaches are limited by the fact that it is rather impossible to classify new clients. This may be the goal for future research. Finally, note that such an approach, while applied to count claims in this paper, can also be extended to the severities of different types of claim, by considering bivariate models with copulas and continuous marginals.

A final remark relates to identifiability issues about the model. Issues related to identifiability of finite mixtures defined with copulas have been discussed in Kosmidis and Karlis (2016). In general, it is very hard to derive theoretical results due to the complicated nature of the copulas. A natural assumption is to enforce some lexicographical ordering in the components so as to ensure that we avoid the case when interchanging the order of the components we end up with the same mixture. An additional identifiability issue in our case relates to the identifiability of mixtures of GLM regression models. Conditions for identifiability for such finite mixtures of regression and GLM models are given in Hennig (2000) and Grün and Leisch (2008). Since theoretical results are hard to establish in general, from a practical perspective, we have worked with several initial values to examine whether we became trapped with different solutions. A lexicographical ordering of the components has been used. We have not found solutions that differ substantially on the regression coefficients giving similar likelihood, adding to our belief that our model is identifiable.