1 Introduction

Insurance companies usually apply generalized linear models (GLMs) to predict insurance claim counts due to the interpretability of these models. GLMs are constantly improved by pricing actuaries via sophisticated choice of variable interactions. This process is time-consuming, depends much on expert judgement, and relies on visual performance indicators. These aspects motivate the usage of machine learning (ML) techniques for improving the performance of GLMs by finding the next-best interaction to be added to the GLM. Such automation of the manual and mainly visual process of fine-tuning GLMs could save much time for actuaries, especially in the case of big data sets with dozens of variables, e.g., in motor third-party liability (MTPL) insurance.

In this paper, we propose a methodology for the detection of the next-best interaction that is missing in a benchmark GLM. We aim at improving an arbitrary but fixed existing benchmark GLM instead of creating a new GLM from scratch. Building a new GLM may necessitate drastic changes in the tariff of the MTPL insurance. Large changes in tariffs are not desired by insurance companies for their existing business lines. Instead, GLMs need to be improved gradually.

The approach we suggest has three steps. First, a combined actuarial neural network (CANN) is trained. This model is introduced in the actuarial context in [22] and can be seen as a combination of a benchmark-GLM predictor and a neutral-network predictor into a single neural network (NN) using a skip connection for the GLM component. Second, the strength of each pairwise interaction learned by the CANN model is quantified and the interactions are ranked by their strength using a neural interaction detection (NID) algorithm. This algorithm is introduced in [19] for fully connected feed-forward neural networks and adjusted by us to CANN models. Third, the top-ranked interactions are analyzed with the help of mini-GLMs and the next-best interaction to be included in the benchmark GLM is identified.

We show the performance of our approach on an artificially created data set, where the true interactions are known from the data-generation mechanism, and on an open-source French MTPL data set, which has been analyzed in many academic sources, e.g., [12, 16, 3, 8, 23], etc. Finally, we comment on the advantages of our methodology for big MTPL data sets with millions observations and dozens of features, which are common for larger insurance companies.

Literature overview: GLMs are introduced by [11] as a generalization of a linear regression model with a normally distributed response variable. Since then, GLMs became an important and popular tool in insurance pricing, especially MTPL insurance. For more information, see [2, 14, 23].

The process of finding the best GLM becomes very challenging with the increasing number of variables. As it is not possible to fit and compare all possible models for a large number of variables, three classical approaches have been developed: forward selection, backward selection, and mixed (bi-directional) selection. Forward selection is a greedy approach and might include variables early that later become redundant. Mixed selection can fix this challenge. As an alternative to stepwise variable selection, one can use penalized likelihood estimation, e.g., Least Absolute Selection and Shrinkage Operator (LASSO) introduced by [18], to find the best subset of variables for a GLM.

However, it is even more computationally challenging to use the above methods to search for the best GLM with interacting variables, as the number of possible interactions is too large, whereas re-fitting even one single GLM on a real-world big data set is time-consuming. Therefore, researchers explored the usage of neural-network based models for predicting claim frequencies and learning from them about the interacting variables, which can be added to a GLM to improve its performance.

First, a neural-network based model is trained. Second, interaction-detection methods are applied. Two types of interaction-detection methods are distinguished—model-agnostic and model-specific ones. Model-agnostic methods do not use a specific structure of the ML model. This class of methods includes Friedman’s H-statistics [4], Greenwell statistics [6], feature interaction in terms of prediction performance [13], and SHAP interaction values [9]. The main drawback of the above-mentioned model-agnostic methods is their high computational cost for big actuarial data sets. Model-specific interaction detection methods rely on the peculiarities of the ML model under consideration. For example, [21] proposes a procedure for determining missing interactions in the benchmark GLM via CANNs. For each interaction of interest, a CANN that uses the interaction of interest and the prediction of the benchmark GLM is trained. If the deviance loss function of the CANN model decreases significantly in comparison to the deviance loss of the benchmark GLM, then this interaction is considered missing. [15] proposes a LocalGLMnet, which retains the additive decomposition of the response variable as in the case of a GLM, but lets the regression coefficients become feature-dependent. Once a LocalGLMnet is trained, one can detect whether there is an interaction between two features by exploring smoothed plots of gradients of the regression coefficients, also called regression attention due to their dependence on the features. How to optimally determine interactions for categorical variables with many levels is still an open question for LocalGLMnets.

In our approach, we also use CANN, as in [16]. We use embedding layers for categorical features with many categories, as it is shown to improve predictive performance on actuarial data sets, see [16, 21]. To extract interactions, we use a model-specific method that is a modification of a method called Neural Interaction Detection (NID). This method is developed in [19]. It computes the strengths of all interactions among input neurons of a neural network very fast, since it uses only trained weights of the neural network, and is extendable to embedding layers and the architecture of a CANN model.

Structure: In Sect. 2, we explain the basics of GLMs. In Sect. 3, we describe in detail our proposed algorithm for detecting the next-best interaction for a benchmark GLM. Each subsection within this section is devoted to a specific part of our algorithm. Section 4 contains case studies, where we apply the proposed algorithm to two data sets—an artificially created one and an open-source one—and briefly comment on its usage for big confidential data sets. We offer our conclusions in Sect. 5. Appendix A contains \(\texttt {R}\)-code for the neural interaction detection algorithm.

2 Generalized linear models for modeling insurance claim frequencies

In this section, we briefly describe the basics of a GLM for modeling claim counts. We start with a definition of a GLM without interacting variables and then explain how an interaction is added to a GLM. This section is mainly based on [21]. Since their introduction in 1972, GLMs have enjoyed great popularity for modeling and forecasting claim frequencies within the insurance sector.

Let a data set be denoted by \(\{(N_i, \pmb {x}_i, v_i)\}_{i=1}^n\), where \(n\in \mathbb {N}\) is the number of observations, \(v_i \in [0,1]\) corresponds to the exposure time in years of the i-th observation (the time length in which events occur), \(N_i\in \mathbb {N} \cup \{0\}\) refers to the number of claims observed for the observation i within exposure time \(v_i\), \(\pmb {x}_i = (1,x_{i,1},x_{i,2},\dots , x_{i,p})^\top \in \mathcal {X} \subset \{1\} \times \mathbb {R}^{p}\) represents the vectorFootnote 1 of variables (features, covariates) for the observation i excluding the exposure time, and \(p\in \mathbb {N}\) is the number of features. In the Poisson GLM context, \(N_i, i = 1,\dots , n,\) are assumed to be independent random variables that follow the Poisson distribution. The mean of the distribution of \(N_i\) is assumed to depend on the so-called systematic (linear) component \(\eta (\pmb {x}_i):= \langle \pmb {\beta }, \pmb {x}_i \rangle = \pmb {\beta }^\top \pmb {x}_i\) and the exposure time \(v_i\) as follows:

$$\begin{aligned}N_i \sim \text {Poisson}(v_i \cdot \exp (\eta (\pmb {x}_i))),\end{aligned}$$

where \(\pmb {\beta } \in \mathbb {R}^{p + 1}\) is the vector of GLM parameters.

Denote \(\lambda ^{GLM}(\pmb {\beta }, \pmb {x}_i):= \exp (\eta (\pmb {x}_i))\). The vector of parameters \(\pmb {\beta }\) is estimated via the maximum likelihood estimation method. We denote the estimated parameters by \(\hat{\pmb {\beta }}\) and the estimated (expected) annual number of claims for an observation \(\pmb {x}_i\) by \(\hat{\lambda }^{GLM}_i:= \lambda ^{GLM}(\hat{\pmb {\beta }}, \pmb {x}_i)\). The weighted average predicted frequency (WAPF) and the weighted average observed frequency (WAOF) are then defined as

$$\begin{aligned} \text {WAPF} = \frac{\sum _{i} \hat{\lambda }^{GLM}_i v_i}{\sum _i v_i}, \text { }\text { } \text {WAOF} = \frac{\sum _{i} N_i}{\sum _i v_i}. \end{aligned}$$
(1)

An important requirement for modeling claim counts is the equality of the WAPF and the WAOF. This property is called a balance property. A GLM satisfies this property on the data set used for model fitting. For the proof of this fact, an interested reader is referred to Equation (2.10) in [20].

In GLMs, different choices of variables lead to different predictive performance. The optimal set of variables can be found by fitting GLMs with different subsets of variables and comparing goodness-of-fit measures like the Akaike information criterion (AIC), the Bayesian information criterion (BIC), etc. These model evaluation criteria can be used for the automated selection of the best subset of available variables for a GLM. Popular automated feature selection methods are stepwise backward, forward, and mixed variable selection methods. As mentioned in the introduction section, these methods have huge computational costs for data sets with a large number of variables.

Next, we explain what an interaction means in the context of a GLM. For simplicity, we focus on pairwise interactions, i.e., those between pairs of variables. However, the concepts below can be extended to higher-order interactions, i.e., those among more than two variables. In a GLM, a pairwise interaction is an additional term \(I(\cdot ,\cdot )\) in the GLM’s systematic component \(\eta (\pmb {x})\) and this term is a function that cannot be expressed as a sum of two separate functions of the corresponding single variables. For example, let \(x_{1}\) and \(x_{2}\) be two numerical variables and \(\beta _{1,2}\) be some parameter to be estimated. Then adding the term \(\beta _{1,2} \cdot x_{\cdot ,1} \cdot x_{\cdot ,2}\) to \(\eta (\pmb {x}_{\cdot })\) is considered adding a pairwise interaction, where \(x_{\cdot ,1}\) denotes a generic observation of feature \(x_1\) and \(x_{\cdot ,2}\) denotes a generic observation of feature. However, including the term \(\beta _{1,2}\cdot \ln \left( x_{\cdot ,1} \cdot x_{\cdot ,2} \right) = \beta _{1,2}\cdot \ln \left( x_{\cdot ,1} \right) + \beta _{1,2} \cdot \ln \left( x_{\cdot ,2} \right) \) does not mean adding an interaction, but instead adding two transformed variables.

The parametric form of an interaction term \(I(\cdot ,\cdot )\) depends on the type of the corresponding covariates that interact. In the following, we provide three common parametric forms of \(I(\cdot ,\cdot )\) depending on the type of variables.

  • Interaction between two numerical covariates: Let \(x_{1}\) and \(x_{2}\) be two numerical variables. For an observation i, the term modeling the interactionFootnote 2 between them can be given by

    $$\begin{aligned}I(x_{i,1},x_{i,2}) = \beta _{1,2} \cdot x_{i,1} \cdot x_{i,2},\end{aligned}$$

    where \(\beta _{1,2}\) is a parameter to be estimated.

  • Interaction between one numerical and one categorical covariate: Let \(x_{1}\) be a numerical variable and \(x_{2}\) be a categorical variable with J categories, where the last one serves as a reference category (also called a base level). For an observation i, the interaction between them can be modeled by

    $$\begin{aligned}I(x_{i,1},x_{i,2}) = \sum _{j=1}^{J-1} \beta _j \cdot x_{i,1} \cdot \mathbb {1}_{\{x_{i,2}=j\}},\end{aligned}$$

    where \(\beta _{j}\) are parameters to be estimated, \(\mathbb {1}_{\{x_{i,2}=j\}}=1\) if the observation i of variable \(x_{2}\) is a category j, 0 otherwise.

  • Interaction between two categorical covariates: Let \(x_{1}\) and \(x_{2}\) be two categorical variables with R and S categories respectively, where the last one each serves as a reference category. For an observation i, the interaction between features \(x_{\cdot ,1}\) and \(x_{\cdot ,2}\) is modeled by

    $$\begin{aligned}I(x_{i,1},x_{i,2}) = \sum _{r=1}^{R-1} \sum _{s=1}^{S-1} \beta _{r,s} \cdot \mathbb {1}_{\{x_{i,1}=r\}} \cdot \mathbb {1}_{\{x_{i,2}=s\}},\end{aligned}$$

    where \(\beta _{r,s}\) are parameters to be estimated, \(\mathbb {1}_{\{x_{i,1}=r\}}=1\) if the observation i of variable \(x_{1}\) is a category r and 0 otherwise and likewise \(\mathbb {1}_{\{x_{i,2}=s\}}=1\) if the observation i of variable \(x_{2}\) is a category s and 0 otherwise.

The search of important interactions is more challenging than the search of the best subset of variables for a GLM, since the number of all possible combinations of interacting variables is usually larger than the number of variables.Footnote 3 Therefore, actuaries often use their expert knowledge to decrease the number of pairwise interactions to analyze in detail. The interactions to be analyzed are mainly explored in a visual manner, e.g., by evaluating plots that indicate the (weighted) average of the response variable for each unique combination of values of variables (or their binned versions).

In the next section, we describe in detail our suggested approach to detecting important pairwise interactions. It is faster than the majority of methods proposed in the literature and, thus, may save actuaries time to focus on other challenging tasks.

3 Algorithmic detection of the strongest interaction missing in a GLM

From now on we refer to the GLM that is to be improved as the benchmark GLM. To detect the next-best interaction for the benchmark GLM, we suggest an algorithm that consists of three steps:

  1. 1.

    Outperform the benchmark GLM using a CANN model.

  2. 2.

    Rank the interactions learned by the CANN according to their strength.

  3. 3.

    Determine the next-best interaction among top-ranked ones using mini-GLMs.

We refer to the model developed in Step 1 as the competitor model. Note that CANNs and other ML models cannot yet replace benchmark GLMs used by insurance companies in production for various reasons: lack of interpretability, etc.

3.1 Outperforming the benchmark GLM via CANN

As previously mentioned, GLMs have been a traditional technique for modeling and forecasting claim frequencies within the insurance sector. However, according to [20] their performance is limited in comparison to models based on NNs, which by their design learn non-linear interactions among variables. Thus, in Step 1 of the suggested approach, we train a CANN that can be seen as a boosting step for the benchmark GLM and is, essentially, a NN that uses the predictions of the benchmark GLM while learning additional interactions among variables to improve the predictive power of the overall model. Before explaining CANNs in more detail, we provide the basics of NNs.

Consider a fully-connected feed-forward NN with \(d \in \mathbb {N}\) hidden layers and one neuron in the output layer. Denote by \(q_l\in \mathbb {N}\) the number of neurons in the l-th hidden layer, \(l = 1, \dots , d\). \(q_0\in \mathbb {N}\) denotes the number of neurons in the input layer. In our application of NNs, \(q_0\) will be equal to the number of neurons needed to encode p variables in the original data set and may be larger than p since each categorical variable usually needs more than one neuron in the input layer. Denote by \(\tilde{\pmb {x}} \in \mathbb {R}^{q_0}\) the vector of pre-processed features that serve as input to the NN. Denote by \(W^{(l)} \in \mathbb {R}^{q_l \times q_{l-1}}\) the weight matrices and by \(\pmb {b}^{(l)} \in \mathbb {R}^{q_l}\) the bias vectors, \(l = 1, \dots , d\). Denote by \(\pmb {w}^{y} \in \mathbb {R}^{q_d}\) and by \(b^{y} \in \mathbb {R}\) the coefficients vector and bias for the output neuron. Denote by \(\phi _l(\cdot )\) the activation function of neurons in the l-th layer \(l = 1, \dots , d+1\) and \(\overrightarrow{\phi _l} (\pmb {\xi }) = (\phi _l(\xi _1), \dots , \phi _l(\xi _{q_l}))^\top \) for any \(\pmb {\xi } \in \mathbb {R}^{q_l}\). Then the hidden layers \(\pmb {z}^{(l)}\) and the output layer consisting of one neuron y (the NN’s prediction) can be expressed as follows:

$$\begin{aligned} y = \phi _{d+1}\left( (\pmb {w}^y)^\top \pmb {z}^{(d)} + b^y\right) , \quad \pmb {z}^{(l)} = \overrightarrow{\phi _l}\left( W^{(l)}\pmb {z}^{(l-1)} + \pmb {b}^{(l)} \right) , \quad l = 1,\dots , d, \end{aligned}$$

with \(\pmb {z}^{(0)}:= \tilde{\pmb {x}}\). Let \(\phi _{d+1}(z) = z\) and denote the regression function of a NN by

$$\begin{aligned} \lambda ^{\text {NN}}(\tilde{\pmb {x}}):= \left( w^y \right) ^\top \left( \pmb {z}^{(d)} \circ \pmb {z}^{(d - 1)} \circ \dots \circ \pmb {z}^{(1)} \right) (\tilde{\pmb {x}}) + b^y. \end{aligned}$$
(2)

A combined actuarial neural network (CANN) for claim counts satisfies two model assumptions:

  1. 1.

    \(N_i \sim \text {Poisson}(v_i \cdot \lambda ^{\text {CANN}}(\tilde{\pmb {x}}_i))\) with the regression function \(\lambda \) given by

    $$\begin{aligned} \tilde{\pmb {x}}_i \mapsto \ln \left( \lambda ^{\text {CANN}}(\tilde{\pmb {x}}_i) \right) = \ln (\hat{\lambda }^{\text {GLM}}_i) + \lambda ^{\text {NN}}\left( \tilde{\pmb {x}}_i \right) \end{aligned}$$
    (3)
  2. 2.

    The regression function in (3) is initialized with weights \(w^y = (0, 0, \dots , 0)^\top \in \mathbb {R}^{q_{d}}\) and \(b^y = 0\).

The first structural assumption means that the NN part of CANN boosts the benchmark GLM. The second structural assumption implies that at the beginning of the training phase, the Poisson deviance of a CANN model equals the Poisson deviance of the benchmark GLM. If the Poisson deviance loss is used as the objective function for training a CANN, then during training the gradient-descent algorithm explores the NN architecture for an additional model structure that is not present in the benchmark GLM and that further decreases the CANN’s Poisson deviance.

It is not necessary to know the structure of the benchmark GLM, only its predictions are used by a CANN model. So the weights of the benchmark GLM are non-trainable and the implementation of a Poisson CANN can be simplified by merging the annualized predictions of the benchmark GLM with the given volumes. In particular, as an alternative to the first structural assumption of a Poisson CANN, we can consider \(N_i \sim \text {Poisson}(v^{\text {GLM}}_i \cdot \lambda ^{\text {NN}}(\tilde{\pmb {x}}_i))\) with modified exposure \(v_i^{\text {GLM}}:= v_i \cdot \hat{\lambda }^{\text {GLM}}_i\).

The architecture of a CANN is illustrated in Fig. 1. The neuron marked blue takes as input the modified exposure \(v_i^{\text {GLM}}\) and passes it directly to the output neuron marked green, whereas the corresponding red-marked connection has a non-trainable weight 1 and a bias coefficient 0. The neurons marked black and connections among them constitute the NN component of a CANN. In the NN component of the CANN model shown in Fig. 1, \(q_0 = 8, q_1 = 6, q_2 = 4, q_3 = 1\). The output \(\lambda ^{\text {NN}}\) of the NN component is passed via the red-marked connection (with a non-trainable weight 1) to the green-marked output neuron. The output neuron sums two incoming values and applies the exponential activation function on the result. The CANN model is trained using the Poisson deviance loss function.

Fig. 1
figure 1

Architecture of a CANN model

As mentioned above, the original vector of variables \(\pmb {x} \in \mathbb {R}^{p}\) does not enter the input layer of a NN, but its pre-processed version \(\tilde{\pmb {x}}\in \mathbb {R}^{q_0}\) does. In particular, all features that appear in the input layer of a NN, must not contain missing values, should be numerical, and ideally have the same range. Therefore, we use min-max-scaling to all numerical features used for training a NN. As for categorical features, we recommend using one-hot encoding for features with a low number of unique categories, e.g., below 5, and use the embedding layers technique for those with a larger number of unique categories. These techniques are recommended in [21].

An embedding e of a categorical feature with k distinct categories \(\{a_1,...,a_k\}\) is a mapping

$$\begin{aligned} e:\{a_1,\dots ,a_k\}\rightarrow \mathbb {R}^{g}, \text { } a\mapsto e(a), \end{aligned}$$

with \(g\in \mathbb {N}\) denoting the dimension of the embedding. This dimensionality parameter is chosen by the user, whereby typically \(g \ll k\). The components

$$\begin{aligned}e(1)_1,\dots ,e(1)_g,\dots , e(k)_1,\dots ,e(k)_g\end{aligned}$$

of such an embedding of k categories constitute additional NN weights that are learned during training. So an embedding layer of dimension g results in additional \(g\cdot k\) embedding weights. The embedding representation of an embedded feature, i.e., the output of the embedding layer, equals the embedding weights. A NN with the above-mentioned peculiarities is schematically illustrated in Fig. 2.

Fig. 2
figure 2

Example of the NN part of a CANN model that uses a 2-dimensional embedding layer (in light blue) encoding a categorical feature \(\tilde{x}_{\cdot , 7}\)

To reduce the risk of overfitting, we recommend using a dropout technique and an early stopping of the NN training process. According to the drop-out technique proposed by [17], a pre-specified percentage (a so-called dropout rate) of neurons randomly selected in each layer is “switched off” and not updated in a NN training step. According to the early-stopping method, the NN training is stopped as soon as a significant deterioration or no significant improvement in the model performance is observed within a predefined period of time. A Poisson CANN model does not satisfy the balance property (1). As we will see in the numerical studies and as it is also found in the numerical studies of [16], the violation of the balance property is very small, since a CANN model uses the predictions of a GLM that fulfills this property. In the view of the interaction detection as the main focus of our paper, the violation of the balance property is negligible. Readers interested in enforcing the balance property on neural networks are referred to [20] and [21].

Before training the above-described CANN, one has to specify certain hyper-parameters, e.g., the embedding dimension, the number of hidden layers, the number of neurons per layer, the dropout rates, the activation functions, loss functions and the optimizer, the batch size, the number of epochs, or the usage of early stopping, etc. Our experiments on both artificial and real data sets show that choosing 3 hidden layers, an embedding dimension of 2, the Poisson deviance as a loss function, and a dropout rate of \(10\%\) has very high chances for a CANN to outperform sophisticated benchmark GLMs. If one would like to further improve the performance of the ML model, then one should explore ML models with different values of hyper-parameters. The search for the optimal values of hyper-parameters can be done either via a grid search or a genetic algorithm. The latter approach is more time-consuming, but it can yield a better model performance.

To compare the performance of a ML model (for short competitor) and a benchmark GLM (for short benchmark) in the context of MTPL insurance claim counts, we recommend using so-called double lift plots on the test data set. These plots are of high practical importance for actuaries, see, e.g., Section 7.2.2 in [5]. A double lift-plot requires predictions of each of the two models and the true observed values of the response variable. A double lift plot is created in the following way:

  1. 1.

    Determine the deviance \(\delta _i\) (also called the sort ratio), which is the relative difference between the competitor model and the benchmark GLM:

    $$\begin{aligned}\delta _i = \frac{\hat{\lambda }^{\text {competitor}}_i}{\hat{\lambda }^{\text {benchmark}}_i} - 1,\end{aligned}$$

    where \(\hat{\lambda }^{competitor}_i\) denotes the i-th prediction of the competitor model and \(\hat{\lambda }_i^{\text {benchmark}}\) refers to the i-th prediction of the benchmark GLM.

  2. 2.

    Sort the observations based on \(\delta _i\), from smallest to largest.

  3. 3.

    Bucket the observations into predetermined bins in an interval of interest, e.g., bins \((-\infty , -0.5]\), \((-0.5, -0.48]\), \((-0.48, -0.46], \dots , (0.48, 0.5]\), \((0.5, +\infty )\).

  4. 4.

    For each bin, calculate the exposure, WAOF, WAPF of the competitor model, and WAPF of the benchmark model.

  5. 5.

    For each bin, plot the quantities calculated in Step 4. The left y-axis refers to the WAOF or WAPF that are marked by dots in the double lift plot. The right y-axis refers to the exposure that is depicted by bars below the dots.

An example of a lift plot with predetermined binning can be seen in the left sub-figure of Fig. 3. As an alternative to bucketing the observations based on the predetermined binning, one can use quantile-based binning. A double-lift plot of this type can be seen in the right sub-figure of Fig. 3. In this chart, each bin has the same number of observations and is determined based on the quantiles of the distribution of \(\delta _i\).

Fig. 3
figure 3

Lift plot 1 has predetermined bins (PB). Lift plot 2 has quantile-based bins (QBB)

Obviously, the evaluation of these lift plots is based on visual perception. In order to allow for a purely quantitative model evaluation, we construct KPIs reflecting the information captured in these lift plots and, thus, not requiring visual evaluation of lift plots.

Let \(B=\{1,...,|B|\}\) be the set of bins in a lift plot and \(b \in B\) an index of a certain bin. Define the weighted exposure \(w_b\) per bin as follows:

$$\begin{aligned} u_{b}=\frac{\sum _{x_{i} \in \mathcal {X}_{b}} v_{i}}{\sum _{x_{i} \in \mathcal {X}} v_{i}}, \end{aligned}$$

where \(\mathcal {X}_b\) is the set of features vectors that correspond to observations in bin \(b \in B\). The numerator equals the total exposure of observations in a bin b and the denominator is equal to the total exposure in the whole data set used for calculating this KPI. Using this weighted exposure per bin, the mean absolute error based on the lift plot bins is given by

$$\begin{aligned}\mathrm{mae\_lift\_...}= \sum _{b \in B} u_{b}|\text {WAPF}_{b}-\text {WAOF}_{b}|.\end{aligned}$$

All in all, we thereby construct a selection of numerical KPIs, namely

  • \(\mathrm{mae\_lift\_pb}\) and \(\mathrm{mae\_lift\_pb\_benchmark}\), which are based on the lift plot with predetermined bins and use the predictions of the competitor model and the benchmark model respectively

  • \(\mathrm{mae\_lift\_qbb}\) and \(\mathrm{mae\_lift\_qbb\_benchmark}\), which are based on the lift plot with quantile-based bins and use the predictions of the competitor model and the benchmark model respectively

The smaller the value of \(\mathrm{mae\_lift\_\dots }\), the better the model. For example, if \(\mathrm{mae\_lift\_pb}\) is smaller than \(\mathrm{mae\_lift\_pb\_benchmark}\), the competitor model outperforms the benchmark model based on the lift-plot with predetermined bins. The same reasoning holds for the KPIs using quantile-based binning.

In summary, actuaries often rely on lift plots when evaluating model performance. However, the visual interpretation of such lift plots may be rather subjective. Hence, the transformation of the lift plot into a numeric KPI and using it along with Poisson deviance for model selection may enhance reliability and the objectivity of the performance evaluation. For example, it may happenFootnote 4 that Poisson deviance is the same for two models, but one model is convincingly better than the other one according to lift-plot based KPIs.

3.2 Opening the black box: ranking learned interactions

Having found a well-performing CANN, the next step is to find the most significant pairs of interacting variables learned by the model. Here, “significant” means that those pairs of interacting variables that are captured by the CANN model are likely to strongly improve the predictive power of the benchmark GLM if included in it. In the CANN model, its NN component learns non-linear interactions among the input features. To quantify the significance of each of the learned interactions, we apply a fast model-specific interaction-detection method. The method can be seen as an adjustment of a technique called Neural Interaction Detection (NID), proposed by [19] for fully-connected feed-forward NNs.

The original NID algorithm is based on the assumption that feature interactions are created in the first hidden layer of a neural network. Note that learning interactions in the first hidden layer is possible due to the usage of non-linear activation functions. Moreover, [19] provides an empirical evidence that considering the first hidden layer is indeed sufficient for determining interactions. These interactions are then propagated through the whole network and influence the final prediction. This concept is exemplified by Fig. 4. As can be seen, the first neuron in the first hidden layer \(\pmb {z}^{(1)}\) (highlighted in blue) takes inputs \(\tilde{x}_1\) and \(\tilde{x}_3\) and thereby creates an interaction between them if the activation function of that neuron is non-linear.Footnote 5 The strength of this interaction is evaluated based on both incoming weights as well as the outgoing paths from the neuron to the output neuron y as colored in blue in Fig. 4. The higher the incoming weights and the higher the impact of the considered neuron on the final output, the stronger the interaction. The strength of an interaction is quantified by an interaction strength score.

Fig. 4
figure 4

Generation of interactions in the first hidden layer and propagation of these interactions through the neural network (figure adapted from [19])

Let \(\mathcal {I}\) be a pair of input neurons. The interaction between these input neurons happens at each neuron of the first hidden layer. Denote by \(s_j(\mathcal {I})\) the strength of an interaction between input neurons in \(\mathcal {I}\) measured at the j-th neuron in the first hidden layer, \(j=1,\dots , q_1\). It is quantified as follows:

$$\begin{aligned} s_j(\mathcal {I})=\zeta _j^{(1)}\cdot \mu (|W_{j,\mathcal {I}}^{(1)}|), \text { } s_j(\mathcal {I}) \in \mathbb {R}, \end{aligned}$$
(4)

where \(\zeta _j^{(1)}\) represents the influence of neuron j in the first hidden layer on the model prediction, \(|W_{j,\mathcal {I}}|\) denotes the absolute value of the incoming weights from features in \(\mathcal {I}\) to neuron j in the first hidden layer, and \(\mu (\cdot )\) represents a so-called generalized surrogate function used to capture the strength of the interaction based on the relevant incoming weights. In our notation, \(|\cdot |\) applied to a matrix means that the absolute value is taken element-wise, i.e., for all matrix elements.

As per [19], the generalized surrogate function \(\mu (\cdot )\) should be such that interaction strength is

  1. 1.

    Quantified as zero when the interaction does not exist;

  2. 2.

    Non-decreasing in the magnitude of feature weights;

  3. 3.

    Less sensitive to changes in large feature weights.

The third property mitigates the impact of situations, when the weight of the connection from one input neuron has much higher magnitude than the weight of connection from another input neuron. If the large weight grows in magnitude, then interaction strength should not change much, but if instead the smaller (in magnitude) weight grows at the same rate, then interaction strength should increase. Thus, maximum, root mean square and arithmetic mean are not suitable candidates for \(\mu (\cdot )\). [19] empirically investigates a selection of possible surrogate functions and concludes that the minimum is the best-performing function that recovers the highest number of true interactions in their experiments. The second-best choice for \(\mu (\cdot )\) was the harmonic mean function. Therefore, we choose \(\mu (\cdot )\) as minimum in all our experiments.

The influence \(\pmb {\zeta }^{(1)}\) on the network prediction is calculated as the following matrix product of the absolute weight matrices:

$$\begin{aligned} \pmb {\zeta }^{(1)}=|\pmb {w}^{y}|^{\top }\cdot |W^{(d)}|\cdot |W^{(d-1)}|\cdot ... \cdot |W^{(2)}|, \text { } \pmb {\zeta }^{(1)} \in \mathbb {R}^{q_1}. \end{aligned}$$
(5)

In this case, \(q_1\) denotes the number of neurons in the first hidden layer, \(W^{(m)}\) each represents the weight matrix connecting the neurons in hidden layers \(m-1\) and m, whereas \(\pmb {w}^{y}\) denotes the vector of weights connecting the last hidden layer and the output neuron. Note that \(\pmb {\zeta }^{(1)}\) results in a vector where the j-th index corresponds to the influence of a neuron j of the 1-st hidden layer on the output neuron of a NN. According to Lemma 3 in [19], if all activation functions in a NN are 1-Lipschitz continuous, then expression (5) is an upper bound for the gradient magnitudes of neurons in the first hidden layer, i.e., if \(\left| \frac{\partial \phi (x) }{ \partial x} \right| \le 1\), then \(\left| \frac{\partial y }{ \partial z_j^{(1)}} \right| \le \zeta _j^{(1)}\) for all \(j = 1, \dots , q_1\). Common activation functions such as rectified linear unit, hyperbolic tangent and sigmoid are 1-Lipschitz continuous.

After having extracted the incoming weights of the NN as well as the importance (w.r.t. the influence on the NN’s output) of each neuron in the first hidden layer, the strength of a (local) interaction between a subset of input neurons can be computed for each neuron of the first hidden layer. Subsequently, the final interaction strength score for this subset of input neurons is equal to the sum of local interaction strength scores across all \(q_1\) neurons in the first hidden layer:

$$\begin{aligned}s(\mathcal {I}) = \sum _{j=1}^{q_1} s_j(\mathcal {I}),\end{aligned}$$

which is exemplary illustrated in Fig. 5.

Fig. 5
figure 5

Illustration of the interaction-strength calculation

Recall that a CANN model is a combination of a NN and a benchmark-GLM prediction with a skip-connection to the output neuron of the model, where the NN component is a feed-forward fully-connected neural network. Therefore, we can apply NID to the NN component of CANN model to quantify the strength of all pairwiseFootnote 6 interactions among features. We provide the \(\texttt {R}\)-code for the described NID approach in Listing 5 in Appendix A.

The original NID method evaluates the strengths of interactions among input neurons. However, the encoding of categorical features may require several neurons in the input layer of a NN, e.g., in case of one-hot encoding one needs the same number of input neurons as the number of categories. If an actuary wants to detect interactions on a per-category level, then it is not neccessary to aggregate NID scores related to neurons encoding a categorical variable. However, to obtain the interaction-strength scores related to a categorical feature taken as a whole, one has to aggregate the scores on a per-neuron basis using some aggregation function like mean, minimum, maximum. If an actuary is interested in finding interactions where the majority of categories of the categorical feature of interest are strongly interacting with another variable, then min is recommended. If the aim is to find categorical variables whose categories have on average high interaction-strength scores with the other variable, then mean is a good choice for aggregation. Finally, if an actuary is interested in finding categorical variables where one category is especially strongly interacting with the other variable, then we recommend to use max as the aggregation function. For example, to get the strength of an interaction between a categorical one-hot encoded feature and a numerical feature, one can take the maximum of all \(s(\mathcal {I})\) where \(\mathcal {I}\) contains an input neuron encoding a category of the categorical feature of interest and an input neuron encoding the numerical feature of interest.

3.3 Identification of the next-best interaction for a GLM

After extracting the most significant interactions, the final step is to determine the next-best interaction for the benchmark GLM. This step is necessary for several reasons. First, the inclusion of any interaction in a GLM requires a parametric specification \(I(\cdot , \cdot )\) of the interaction. This is also important for preserving the interpretability of the benchmark GLM. Second, it may happen that several top-ranked interactions have very similar interaction-strength scores according to NID, which is why choosing the next-best interaction may become ambiguous. In this case, an actuary may want to estimate the improvement of the benchmark GLM for each of the top-ranked interactions and afterwards decide which one to include and retrain the benchmark model with the found interaction.

To decide which interaction to add to the benchmark GLM, we suggest to predict the observed claim counts via “mini-GLMs” that use the predictions of the benchmark GLM and the top-ranked interactions. This approach can also be interpreted as freezing the coefficients of the benchmark GLM and adding one interaction on top to better predict the claim counts. The approach works as follows:

  1. 1.

    For each \((x_{j}, x_{k})\) from the list of top-ranked interactions and for each relevant parametric form of \(I(\cdot , \cdot )\)

    1. (a)

      Fit a mini-GLM assuming:

      $$\begin{aligned} N_{i} \sim \text {Poisson}\left( v_{i}\hat{\lambda }^\text {benchmark}_{i} \cdot \exp \left( I(x_{i,j}, x_{i,k}) \right) \right) , \,\, i = 1,2, \dots , n. \end{aligned}$$
    2. (b)

      Calculate the KPIs of interest, e.g., AIC, residual deviance, etc.

  2. 2.

    Recommend as the next-best interaction the one that corresponds to the mini-GLM with the best KPIs.

Remarks

  1. 1.

    The word “relevant” refers to the fact that the exact functional form \(I(x_{\cdot ,j}, x_{\cdot ,k})\) of the interaction between variables \(x_{j}\) and \(x_{k}\) depends on the types of these variables (see the Sect. 2). The expressions \(x_{\cdot ,j}\) and \(x_{\cdot ,k}\) denote the respective values (realizations) of variables \(x_j\) and \(x_k\) in a generic observation.

  2. 2.

    If at least one of the interacting variables is continuous, one has multiple options for choosing the parametric form of the interaction:

    1. (a)

      Consider several continuous transformations of the continuous feature(s) of interest. For example, an actuary may consider only parametric interactions of the form of \(I(x_{\cdot ,j}, x_{\cdot ,k}) = \beta _{j,k}\cdot x_{\cdot ,j}^{a} \cdot x_{\cdot ,k}^b\) for \(a \in \{ 1,2,3 \}\) and \(b \in \{ 1,2,3 \}\). The form that leads to a mini-GLM with the best KPI is chosen.

    2. (b)

      Bin the continuous feature(s) of interest and include the interaction between the binned versions of those features. A simple binning procedure can be based on the quantiles of their distribution. A more advanced binning procedure can be based on fitting a generalized additive model (GAM) that uses only a smooth version of the interaction of interest and the predictions of the benchmark-GLM as offset. Afterwards one trains a regression tree that predicts the GAM-captured interaction effect using the interacting features and concludes the optimal binning from the splits of the regression tree. For more information on this method, see Section 4.2 in [7].

  3. 3.

    It may be computationally challengingFootnote 7 to fit a mini-GLM for categorical features with a large number of categories, e.g., postcode. For these cases, we recommend clustering categories of such variables based on the embedded representations of those variables.

4 Case studies

In this section, we summarize the results of several case studies, which we conduct on a computer with 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz processor, 32 GB RAM, Intel(R) Iris(R) Xe Graphics, and Windows 10 Enterprise operating system. In the first case study, we work with an artificially generated data set, where we know the true interactions among variables in the data set. The aim of this case study is to show that our methodology detects and recommends the true interaction. In the second case study, we work with an open-source data set, where the true interactions in the data are not known. In the third case study, we briefly discuss the benefits of our methodology for big data sets, since big insurers have millions of observations and keep track of tens of variables.

4.1 Artificial data set

In this subsection, we apply the previously described methodology to an artificially created data set. We start with generating vectors of covariates \(\pmb {x}_i = (x_{i.1}, x_{i,2}\dots , x_{i,10})^\top \in \mathbb {R}^{10}\), \(i = 1,\dots , 2 \cdot 10^6\), such that each vector is independent of other vectors. Variables \(x_1,\dots , x_8\) are numerical and come from a multivariate normal distribution with zero mean and unit variance, as in [15], namely \((x_{1}, \dots , x_{8})^\top \sim N(0,\Sigma )\) with \(\Sigma \) being an identity matrix with an additional entry of 0.5 in the cells (2, 8) and (8, 2). Covariates \(x_9\) and \(x_{10}\) are categorical and come from a binomial distribution. To be specific, the variable \(x_9 \sim \text {Binomial}(2, 0.3)\) with three categories \(\{0, 1, 2\}\) and is independent of other variables. The covariate \(x_{10} \sim \text {Binomial}(5, 0.2)\) has six categories \(\{0, 1, 2, 3, 4, 5\}\) and is independent of other covariates. For simplicity, we assume that \(v_i = 1,\, i = 1,\dots , 2 \cdot 10^{6}\).

Based on the above-generated features and exposure, we generate the claim counts as follows. First, we calculate

For a small number of feature vectors it holds \(\mu (\pmb {x}) > 1\). In those cases we set \(\mu (\pmb {x}) = 1\) to avoid unrealistically large number of claim counts for those vectors. In the final step of the data generation process, we obtain claim counts by generating them as follows:

$$\begin{aligned} N_i \sim \text {Poisson}(v_{i}\cdot (\mu (\pmb {x}_i))), \quad i = 1,\dots , 2 \cdot 10^{6}. \end{aligned}$$

The structure of the resulting data set is summarized in Listing 1.

figure a

We split the data set as follows: \(80\%\) for training, \(10\%\) for validation, \(10\%\) for testing. This is a rule for splitting data according to [1]. The training set is used for fitting the model, the validation set is utilized for fine-tuning the hyper-parameters of the ML model, and the test set is used for evaluating the final out-of-sample performance of the chosen best-performing ML model. This results in the claim distributions shown in Table 1.

Table 1 Distribution of the number of claims

To fit a benchmark GLM, we use both training and validation data. In this GLM, we include terms \(x_1\), \(x_2^2\), \(x_3\), \(x_3^2\), \(x_9\), \(x_{10}\), which appeared in the data generation process. However, we do not include in the benchmark GLM the interactions between features \(x_4\) and \(x_5\) and between features \(x_5\) and \(x_6\), which are the true interactions according to the process of the artificial data generation. If our interaction-detection methodology works correctly, one of these interactions will be recommended as the next-best one to be included in the benchmark GLM.

4.1.1 Step 1: training CANN

We conduct the following data pre-processing steps prior to training the CANN model:

  • Use one-hot encoding for the categorical feature \(x_9\).

  • Use a 2-dimensional embedding layer for the categorical feature \(x_{10}\).

  • Apply min-max scaling to all numerical features \(x_1, \dots , x_8\):

    $$\begin{aligned} \tilde{x}_{\cdot , j} = \frac{2 \cdot (x_{\cdot , j} - \min (x_{\cdot , j}))}{\max (x_{\cdot , j}) - \min (x_{\cdot , j})} - 1,\quad j = 1,\dots , 8. \end{aligned}$$

To fit a CANN, we use the \(\texttt {R}\) package \(\textit{keras}\). The search for the optimal hyper-parameters of a CANN is based on the KPIs introduced in Subsection 3.1. To find the best CANN model, we search for the best combination of hyper-parameters along a pre-defined grid of hyper-parameters. We focus on the leaky rectified linear unit (LReLU), sigmoid (\(\sigma \)) and hyperbolic tangent (TanH) activation functions that are defined as

$$\begin{aligned} \text {LReLU}({w},\alpha )&= \max ({w}, \alpha \cdot {w}),\text { }\\ \sigma ({w})&= \frac{1}{1+e^{-{w}}}, \text { }\\ \text {TanH}({w})&= 2\sigma (2{w})-1, \end{aligned}$$

with w denoting the weighted sum of the inputs of the neuron to which the activation function is applied and \(\alpha \in (0, 1)\) is a parameter, which we set to 0.3 in all our case studies. We use the Poisson deviance loss function, which is minimized via the RMSProp optimizer. To prevent overfitting, we use drop out rate of \(5\%\) and early stopping of model training when the value of the loss function is not improved 5 epochs in a row. We set the dimension of embedding layers to 2 and the number of neurons in the first, second and third hidden layers to \(q_1 = 20\), \(q_2 = 15\), \(q_3 = 10\) respectively. In addition, the batch size is set to 1000 and the number of epochs to 100.

According to both Poisson deviance and the lift-plot-based KPIs mae_lift_pb and mae_lift_qbb, the best CANN model (among those we tested) for the artificial data has LReLU activation function in all neurons of all hidden layers. This architecture is summarized in Fig. 6.

Fig. 6
figure 6

The architecture of the NN component of the best-performing CANN model

As can be seen, the input layer is composed of 13 neurons corresponding to the 8 numeric features and the one-hot encoded feature capturing 3 categories. Moreover, the categorical feature \(x_{10}\) is encoded via an embedding layer of dimension two. The input layer is connected to the first hidden layer via the weight matrix \(W^{(1)}\). Similarly, the first and the second hidden layers are connected via weight matrix \(W^{(2)}\), the second and the third hidden layers are connected via weight matrix \(W^{(3)}\). The third (last) hidden layer is connected to the output layer via the vector of weights \(\pmb {w}^{y}\).

The KPIs for the best CANN are summarized in Table 2.

Table 2 KPIs on the test data for the best-performing CANN model

4.1.2 Step 2: ranking of learned interactions via neural interaction detection

From the best-performing CANN model, we extract the weight matrices \(W^{(1)}\), \(W^{(2)}\), \(W^{(3)}\) as well as the vector \(\pmb {w}^y\). The weight matrices can be extracted using the \(\textit{get\_weights}\) function in \(\texttt {R}\). The structure of the resulting output is shown in Listing 2. The first element of the list corresponds to the embedding weight matrix \(W^{(e(\tilde{x}_{12}))} \in \mathbb {R}^{7 \times 2}\) depicted in Fig. 6. The second list element represents the transposed version of the weight matrix \(W^{(1)}\) connecting the input layer consisting of 13 neurons (11 numeric & one-hot features + \(1\cdot 2\) neurons of the embedding layer related to \(x_{10}\)) with the first hidden layer, which has 20 neurons. Likewise, the fourth element of the list and the sixth one correspond to the transposed versions of the weight matrices \(W^{(2)}\) and \(W^{(3)}\) respectively. The eighth element of the list is the vector \(\pmb {w}^y\) that connects the last hidden layer of the NN component with its output neuron y. The third, the fifth, the seventh and the ninth element of the list represent bias vectors corresponding to the three hidden layers and the output layer of the NN component of a CANN, respectively. The tenth and eleventh element of the list are the non-trainable weights of the NN component (see red connections in Fig. 1). The twelfth (last) element of the list is the (non-trainable) bias element related to the output neuron of CANN.

figure b

Next, we apply the modified NID to calculate the strength of interactions for each pair of features. Following the recommendation of [19], we use \(\min (\cdot )\) as a surrogate function \(\mu (\cdot )\). Having obtained the strength of interactions for each pair of input neurons, we apply the aggregation procedure for categorical features, as proposed in Subsection 3.2. In particular, we use minimum as the aggregation function.Footnote 8 Finally, we sort the resulting list and provide top 5 entries in Table 3.

Table 3 Top 5 pairwise interactions according to the CANN \(+\) NID approach

As can be seen, our modified NID procedure ranks the interactions between features \(x_4\) and \(x_5\) and between features \(x_5\) and \(x_6\) as the first and the second respectively. Interestingly, the NID procedure suggests that the third-ranked interaction happens between \(x_4\) and \(x_6\). The reason for it is that \(x_5\) appears in two interactions: \(0.5\cdot x_4 x_5\, \text {and}\, 0.125\cdot x_5^2 x_6\). The strength of interactions among other variables is quantified as much lower.

Next we compare our method with another approach used by practitioners, namely training a gradient boosting machine (GBM) and calculating Friedman’s H-statistic for each pair of features. Training one GBM model takes around 120 seconds. Calculating Friedman’s H-statistic is very time-consuming for the whole data set. Therefore, we consider only a small portion of data, namely \(10^4\) observations, which is \(0.5\%\) share of all data. In this case, the calculation takes about 40 seconds. We report the results in Table 4.

Table 4 Top 5 pairwise interactions according to GBM \(+\) Friedman H-statistic approach

According to Table 4, the true interactions have the largest H-statistic and are, thus, the strongest ones according to the method of training a GBM model and calculating Friedman’s H-statistic for all possible pairs of variables. However, a different amount of data may lead to a different computation time and may result in a different ranking. For example, the calculation of this interaction-strength measure for the same GBM model but using \(5\%\) of data (\(10^5\) observations) took about 350 seconds and indicated a few strong but false interactions, e.g., interactions between variables \(x_1\) and \(x_2\), \(x_7\) and \(x_8\) had the H-statistic of 1.

We would like to close this subsubsection with a brief comparison of two methods. According to [10], Friedman’s H-Statistic:

  1. 1.

    Can be applied to any model;

  2. 2.

    Is defined through the partial dependence decomposition and calculates the share of variance that is explained by the interaction;

  3. 3.

    Is usually (but not always) between 0 and 1, which allows for comparison across different models;

  4. 4.

    Detects all forms of interactions, independently of their specific structure;

  5. 5.

    Can be used for quantifying the strength of higher-order interactions, i.e., the interaction among 3 or more features

  6. 6.

    Is computationally time-consuming;

  7. 7.

    May lead to unstable results if not all data points are used, as the estimates also vary from run to run, which is why it is recommend to compute the H-statistic multiple times;

  8. 8.

    Does not provide a clear answer whether the interaction is statistically significant and it is not clear whether H-statistic is large enough to consider an interaction “strong”;

  9. 9.

    Does not give the functional form of the interaction;

  10. 10.

    Has the assumption that features can be shuffled independently, which is, however, violated if features are strongly correlated;

  11. 11.

    May yield unexpected results for small amount of data.

Our approach of applying NID method to the NN component of a CANN model:

  1. 1.

    Is model specific and works only for feed-forward NNs with some regularity conditions on activation functions;

  2. 2.

    Is based on the decomposition of the strength of interaction between input neurons into two parts: the strength of connections from those input-layer neurons to the neurons in the first hidden layer, the influence of neurons in the first-hidden layer on the output neuron of the NN;

  3. 3.

    Does not lead to the interaction-strength score that is normalized between 0 and 1, which makes it challenging to compare the scores across different NNs;

  4. 4.

    Detects all forms of interactions learned by the NN, independently of their specific structure;

  5. 5.

    Can be used for quantifying the strength of higher-order interactions, i.e., the interaction among 3 or more features;

  6. 6.

    Is computationally fast, since it requires only cumulative matrix multiplications of the matrices with absolute values of trained weights in the NN;

  7. 7.

    Always leads to the same result, given that the NN is fixed, since the method does not explicitly use data points;

  8. 8.

    Does not provide a clear answer whether the interaction is statistically significant and it is not clear whether the NID score is large enough to consider an interaction “strong”;

  9. 9.

    Does not give the functional form of an interaction;

  10. 10.

    Has the assumption that the interactions learned by the neural network are captured in the first hidden layer.

4.1.3 Step 3: recommendation of the next-best interaction

As described in Subsection 3.3, for each interaction from Table 3 we fit a mini-GLM and keep track of the corresponding KPIs.

The mini-GLM based on the interaction between features \(x_4\) and \(x_5\) has the lowest AIC and the lowest residual deviance among all 5 mini-GLMs. Therefore, it is selected as the next-best interaction to be included in the benchmark GLM.

The addition of the interaction between features \(x_4\) and \(x_5\) to the benchmark GLM improves the performance of the benchmark GLM. Its residual deviance drops from 596992 to 561969 and its AIC decreases from 804445 to 769424, implying that the model with this interaction should be favored. The Poisson deviance on the test data drops from 0.3314 to 0.3134.

After the benchmark GLM has been updated by adding the recommended interaction between \(x_4\) and \(x_5\), we can repeat the whole process. Namely, training a new CANN model that uses the predictions of the updated benchmark GLM and applying the NID method to the NN component of the trained CANN model, we obtain the ranking of learned interactions as shown in Table 5. We see that the true interaction between features \(x_5\) and \(x_6\) is ranked as the strongest one. It has a much higher score than others. For each of the 5 top-ranked interactions, we train a mini-GLM with simple parametric forms of the interaction, which are described at the end of Sect. 2. As expected, the winning mini-GLM is related the true interaction between features \(x_5\) and \(x_6\). This model has an AIC of 769406 and a residual deviance of 561960 on 1800737 degrees of freedom. The Poisson deviance on the test set is 0.3134.

If we train mini-GLMs with a larger class of parametric forms for interactions, namely, \(I(x_{\cdot ,j}, x_{\cdot ,k}) = \beta _{j,k} \cdot x_{\cdot ,j}^{a} \cdot x_{\cdot ,k}^b\) for \(a \in \{ 1,2\}\) and \(b \in \{ 1,2\}\), the best-performing mini-GLM corresponds to the interaction of the form \(I(x_{\cdot ,5}, x_{\cdot ,6}) = \beta _{5,6} \cdot x_{\cdot ,5}^2 \cdot x_{\cdot ,6}\). This mini-GLM has an AIC of 763910 and a Poisson deviance of 0.3105 on the test set. Adding this interaction to the benchmark GLM leads to an AIC of 763805 and a Poisson deviance of 0.3104 on the test set.

Table 5 Top 5 pairwise interactions in according to the CANN \(+\) NID approach

To justify that our approach does not only work as desired but is additionally way more time efficient, we measure the time required for executing the above described steps of training the CANN model, applying the NID technique, and fitting mini-GLMs. This yields on average approximately 170.3 seconds for the training of one CANN architecture, 1.19 seconds for the application of NID and 6.7 seconds for fitting one mini-GLM.

In this case study, we have verified that our methodology leads to a correct recommendation of the next-best interaction for the benchmark GLM and showed that NID is faster than Friedman’s H-statistic. In the next case study, we work with a real-world open-source data set that has more features than in the toy example considered before.

4.2 Open-source data set freMTPL2freq

In this subsection, we work with an open-source data set \(\textit{freMTPL2freq}\), which is a part of the \(\texttt {R}\) package \(\textit{CASdatasets}\). We choose this data set, since it has been analysed in several papers, e.g., [16, 20, 21, 23]. We take [16] as the main reference and use the benchmark GLM as indicated on page 5 there. Afterwards, we apply our interaction-detection methodology and compare our results with those stated in Section 3.5 of [16].

The data set consists of 678013 observations. Listing 3 provides a glimpse on the data.

figure c

We conduct data pre-processing as in Section 1.3. of [16] and split the data into training data (\(80\%\)), validation data (\(10\%\)), and data for testing (\(10\%\)). Next we train the benchmark GLM, referred to as GLM2 in Section 1.3 of the mentioned paper. The resulting benchmark GLM is summarized in Listing 4.

figure d

4.2.1 Step 1: training CANN

As in the first case study, we conduct the following data pre-processing steps prior to training CANNs:

  • Use one-hot encoding for all categorical features with 5 or fewer categories.

  • Use embedding layers for all categorical features with more than 5 categories.

  • Apply min-max scaling to all numerical features.

We focus on CANNs with three hidden-layers such that \(q_1 = 20\), \(q_2 = 15\), \(q_3 = 10\), and use the same grid of hyper-parameters as the one in the case study with artificially generated data.

The best-performing CANN model has LReLU activation function in all hidden layers. The KPIs of this model on the test data are summarized in Table 6.

Table 6 KPIs on the test data for the best-performing CANN

On the test data set, the best-performing CANN model outperforms the benchmark GLM in terms of all considered KPIs. This is an indication that the NN component that boosts the benchmark GLM may have found some interactions missing in the benchmark GLM.

4.2.2 Step 2: ranking of learned interactions

After training the CANN model, we apply the NID algorithm to calculate the strengths of pairwise interactions that were learned by the NN component, as described in Subsection 3.2. Similar to the case study with an artificial data set, we use minimum as a surrogate function and minimum as an aggregation function. Table 7 summarizes the resulting strongest 10 interactions.

Table 7 Top 10 pairwise interactions according to the CANN \(+\) NID approach

According to Table 7, the interaction between variables VehAge and BonusMalus is much stronger than all other pairwise interactions. The other 4 interactions have a comparable magnitude and do not exhibit a clear “winner” among them.

Next we relate our results to those of [16] by reporting the interactions the researchers identified and indicating their interaction-strength rank according to our methodology: (VehPower; VehAge) with NID rank of 22, (VehPower; VehBrand) with NID rank of 26, (VehAge; VehBrand) with NID rank of 7, (VehAge; VehGas) with NID rank of 2, (DrivAge; BonusMalus) with NID rank of 9. Interestingly, the interaction between BonusMalus and regional variables Area or Region was not detected by the methodology proposed in [16], neither was detected the interaction between VehAge and BonusMalus.

Finally, we compare our results to the method based on GBMs and Friedman’s H-statistic. We choose the following grid of hyper-parameters to search for the best-performing GBM

  • Number of trees 100, 200, 300;

  • Minimal number of observations in a node 10, 25, 50;

  • Shrinkage parameter 0.01, 0.05, 0.1,

and train the corresponding 27 GBM models with the benchmark-GLM prediction as an offset. Training one GBM takes on average 80 seconds for the data under consideration. The best-performing GBM in terms of Poisson deviance has 100 trees, 50 as the minimal number of observations in a node, the shrinkage parameter of 0.1, and the bag-fraction parameter 0.5. The KPIs of this model are reported in Table 8. Interestingly, the best-performing GBM model has a better Poisson deviance than the best-performing CANN model, but its lift-plot based KPIs are worse.

Table 8 KPIs of the best-performing GBM model on the test data

When the whole data set is used, the calculation of Friedman’s H-statistic for each pair of variables takes around 5 minutes. We report the corresponding strongest 8 pairwise interactions in Table 9. The H-statistic for each of the remaining pairwise interactions is 0.

Table 9 Top 8 pairwise interactions according to the GBM + Friedman’s H-statistic approach

We see that the first-strongest pairwise interaction according to the GBM + Friedman’s H-statistic is the second strongest interaction according to the CANN + NID. The first-strongest pairwise interaction according to our approach is ranked as the fifth strongest according to GBM + Friedman’s H-statistic. Interestingly, the pairwise interaction between BonusMalus and regional variables is not captured by the approach of GBM + Friedman’s H-statistic.

4.2.3 Step 3: Recommendation of the next-best interaction

As described in Subsection 3.3, for each interaction from Table 7 we fit a mini-GLM and keep track the KPIs of interest. All mini-GLMs lead to the Poisson deviance of 0.3696 on the test set. Based on the AIC, the winning mini-GLM achieves the lowest AIC of 279859.2 and corresponds to the interaction between BonusMalus and Region. This interaction is then recommended to an actuary for improving the benchmark GLM.

If an actuary prefers to use another performance measure, it may well be that another interaction is recommended as the next-best one. For example, using BIC for evaluating mini-GLMs, our methodology would suggest the interaction between VehAge and VehGas, since the corresponding mini-GLM has the lowest BIC (279941.9).

In contrast to the case study with the artificial data set, we do not know the true functional form of the interaction between variables. Therefore, one may want to explore more sophisticated pairwise interaction terms, as mentioned in Section 3.3 in Remark 2. All in all, the determination of the optimal functional form of the next-best interaction is beyond the scope of this paper. The final decision is to be made by the actuaries.

4.3 Brief discussion on proprietary data sets

Data sets of large insurance companies contain millions observations (policy snippets) with dozens of features.Footnote 9 Some of the categorical features, e.g., postcode or vehicle model, have a high number of categories. In such cases, our methodology is especially powerful. Due to a very large number of possible pairwise interactions, comparing all of them by training as many mini-models or refitting as many times the benchmark GLM would come with huge computation-time costs. An alternative method of finding the best-performing GBM model that uses the benchmark-GLM predictions as offset and then evaluating the strength of all interactions via Friedman’s H-statistic is very time-consuming, as we already saw in the previous case studies for smaller open-source data sets. Our approach to interaction detection is instantaneous, once the CANN model is trained. Moreover, embedding layers in the trained CANN model allow to efficiently cluster categories of categorical variables with a large number of categories (e.g., postcodes, car brands) to be able to include them in the benchmark GLM.

5 Conclusion

In this paper, we propose an approach to detect the next-best interaction to be added to an arbitrary but fixed benchmark GLM in the context of claim counts modeling. First, we trained a CANN model, which can be seen as boosting the benchmark GLM by a neural network. Second, we applied a fast model-specific method called Neural Interaction Detection to quantify the strength of interactions between each pair of features and ranked interactions by their strength. Third, we identified the next-best interaction by comparing a small number of mini GLMs that corresponded to the top-ranked interactions. In the case studies, we validated the approach on two different data sets and discussed its usage for large proprietary data sets.

There are two advantages of our methodology. First, it is a fully automatable way of enhancing a benchmark GLM by including the next-best interaction missing in it. Second, our methodology is faster than other approaches based on Friedman’s H-statistic. Therefore, our methodology is especially suitable for big data sets with dozens of features and millions of observations. As a result, it can substantially decrease the amount of time that pricing actuaries spend on searching for interactions to improve their GLMs, which is usually visual and time-consuming.

The interaction-detection procedure we have introduced has several degrees of freedom, e.g., the encoding of features, hyper-parameters of the NN, the KPIs for selecting the best-performing CANN, and those for comparing mini-GLMs. Therefore, it would be interesting to analyze how sensitive our approach is to different choices for each degree of freedom.