1 Introduction

Modeling interactions between categorical predictors is standard practice in many empirical applications using linear models. For example, in randomized control trials it is common to include interactions between a treatment and a set of covariates to search for treatment effect heterogeneity [9, 14, 16]. Other types of studies on education, health, or labor market outcomes, also commonly include interactions between socioeconomic status and characteristics like race and ethnicity [4, 8, 10, 15]. There is a common approach to modeling categorical predictors and their interactions in linear models that involves encoding each category and each combination of categories using a single binary variable, commonly referred to as one-hot dummy encoding [2]. However, learning from a model with interactions becomes challenging when there are many categorical predictors and/or categories [12].

The simplest case of an interaction between categorical predictors is the one given by two binary predictors. As an illustration, consider the real-world German credit dataset used in our numerical section. The aim is to perform a supervised classification task, where we try to classify people according to a set of predictors as good or bad in terms of credit worthiness. This dataset contains 967 records. Consider two of its binary predictors, namely, Telephone (in clients name) and Foreign worker. The interaction between two binary predictors can be modeled by adding a new binary predictor which is the combination of both characteristics. In our example that would mean individuals with Telephone (in clients name) \(=1\) and Foreign worker \(=1\). Clearly, it is easy to interpret the role of both binary predictors and their interaction, as this only involves looking at three coefficients. In our example, we would have one coefficient for the effect of Telephone (in clients name) \(=1\) compared to the Telephone (in clients name) \(=0\), another one for Foreign worker \(=1\) compared to Foreign worker \(=0\), and the last one for the interaction, i.e., for Telephone (in clients name) \(=1\) and Foreign worker \(=1\). Therefore, in this paper the goal is to binarize the categorical variables.

Continuing with the example above, consider now the case in which we have two categorical predictors, such as Job (with 4 categories) and Purpose (with 11 categories). To model the interactions between two categorical predictors, we need a coefficient for each possible combination of a category from the first predictor and another from the second one. Clearly, when interpreting these two categorical predictors and their interaction, we require (many) more coefficients. In our example we need to estimate 3 coefficients associated with the categories of predictor Job, 10 for the categories of predictor Purpose and \(3\cdot 10=30\) for the interaction terms. This means that we need to estimate a total of 43 coefficients to interpret the role of both categorical predictors and their interaction. Needless to say, the number of parameters to be estimated is even higher if we have more than 2 categorical predictors in the dataset. In our example, if we consider the pairwise interactions between all 13 categorical predictors in the German dataset, we would have to estimate 379 coefficients, after the deletion of the interactions for which we have no data. This makes the estimation of some coefficients imprecise and adds noise to the regression since we have too few records (967) with respect to the high number of parameters to be estimated. Our methodology aims at dramatically reducing this complexity.

In this paper, we propose to find a reduced representation of the categorical predictors as binary predictors to tackle the burden of having too many coefficients to estimate with possibly too few records. As an illustration, let us take the categorical predictor Job which includes categories Unemployed/unskilled - non-resident, Unskilled - resident, Skilled employee/official, Management/self-employed/highly qualified employee/officer. If some of these categories have a similar impact on the response variable, we could group them together. Say, for instance, Unemployed/unskilled - non-resident and Unskilled - resident are in one group and Skilled employee/official and Management/self-employed/highly qualified employee/officer in another group. Thus, instead of 4 binary variables associated with Job, we would have just one, indicating whether the individual shows any category of the first group. Similarly, instead of 11 binary variables for Purpose, after splitting the categories into two groups, we would have just one. Then, the interaction between Job and Purpose would be represented by just one coefficient. By doing so, and after the deletion of interactions for which we have no data, our approach reduces from 379 to 34 the number of coefficients associated with all categorical predictors and their interactions in the German dataset.

In this paper, we propose a novel methodology to binarize the categorical predictors in Generalized Linear Models (GLM) to model interactions. The goal is to split the categories associated with each categorical predictor into two groups, such that categories in the same group have a similar impact on the response variable. Thus, we make categories in the same group share the same coefficient in the GLM, with the hope that accuracy is not affected much while reducing the number of coefficients. We provide a collection of binarized representations for each categorical predictor, where the dissimilarity takes into account information from the main effects and the interactions. The choice of the binarized predictors representing the categorical predictors is made with a heuristic procedure that is guided by the accuracy of the so-called binarized model.

Our approach to binarizing the categorical predictors to model interactions offers several advantages. First, assuming that the samples of records associated with categories are homogeneous enough, by binarizing the categories we avoid having an over-parametrized model with a coefficient to be estimated per category. Second, we have just one coefficient for each categorical predictor and another one for each interaction between two categorical predictors. This is a step towards enhancing the interpretability of the Generalized Linear Model with interactions. Third, our methodology searches for groups of similar categories that have a similar impact on the response. This is in contrast to shrinkage methods like the version of group lasso proposed by [3, 12], where the goal is just to select relevant predictors and interactions. Fourth, since we are grouping together similar categories, with our approach we have more records to estimate each coefficient, which together with the homogeneity ensures lower standard errors as pointed by, e.g., [11] and [6].

The rest of the paper is organized as follows. Section 2 introduces the algorithm to binarize the categorical predictors using information from the main effects and the interactions. Section 3 illustrates the performance of our methodology on real-world and simulated data, compared to lasso and group lasso. Finally, conclusions and future research are collected in Section 4.

2 Methodology

In this section, we detail the methodology to find a reduced representation of categorical predictors as binary predictors. First, we introduce the notation for the Generalized Linear Model (GLM) with categorical predictors and their interactions. We then introduce a dissimilarity measure between categories of the same predictor based on the GLM coefficients. With this dissimilarity, we define an iterative algorithm, where in each iteration we cluster the categories of a predictor into two groups to achieve a reduced representation as a binary variable. The binarized predictors will be used to train the so-called binarized GLM in which each categorical predictor is modeled using its reduced representation.

Let us first describe the required notation. We have J categorical predictors. Predictor j has \(K_j\) categories, which, when needed will be denoted with letters of the alphabet. In the GLM using the traditional one-hot encoding, a categorical predictor j with \(K_j\) categories is represented by \(K_j-1\) binary variables, one for each category, leaving one out for contrast. Therefore, for each categorical predictor, we will leave out one of its categories. We follow the notation in [12]. Consider a GLM where the outcome Y is related to X, comprising the predictors and their interactions, through a link function G:

$$\begin{aligned} {\mathbb {E} \, [Y | X]}=G\Big (\alpha + \sum _{j=1}^{J} X_j \cdot \beta _j + X_{j:l}\cdot \Theta _{j:l} \Big ), \end{aligned}$$
(1)

where \(\alpha \) is the intercept, \(X_j\) is the vector of binary variables associated with the \(K_j-1\) categories of categorical predictor j, with corresponding parameter vector \(\beta _j\). The term \({X_{j:l}}\) is the interaction between categorical predictors j and l, with the corresponding vector of model parameters \(\Theta _{j:l}\), where \({X_{j:l}}\) is the Kronecker product between \({X_j}\) and \({X_l}\). For example, for \(K_j=3\) and \(K_l=4\), we have

$$\begin{aligned} X_{j:l}&= \begin{pmatrix} X_{jb}&{}X_{jc}\\ \end{pmatrix}* \begin{pmatrix}X_{lb}&{}X_{lc}&{}X_{ld}\\ \end{pmatrix}\\ {}&\quad = \begin{pmatrix} X_{jb:lb}&X_{jb:lc}&X_{jb:ld}&X_{jc:lb}&X_{jc:lc}&X_{jc:ld} \end{pmatrix}, \end{aligned}$$

where \(X_{jb:lb}\) is the interaction between category b of predictor j and category b of predictor l, and \(\Theta _{jb:lb}\) is its corresponding coefficient. The rest of the terms can be defined in a similar fashion.

Fig. 1
figure 1

Binarization steps for categorical predictor Job from the German dataset, when interaction effects with categorical predictor Housing are considered

Fig. 2
figure 2

Pseudocode for the binarization algorithm of categorical predictors to model interactions

Table 1 Description of the datasets used to test the binarization algorithm
Table 2 German dataset: description of the categorical predictors

A couple of remarks about the GLM in (1) are worth noting. First, for a binary response variable \(Y \in \{0,1\}\), a natural choice of link function G is the Logit, which we use in Section 3. The approach below can deal with other types of response variables, such as count data, as well as other link functions, such as the one in Poisson regression. Second, among the J categorical predictors we may have binary ones. These are already represented in the most compact form and therefore do not need to be binarized. Third, our methodology can also handle data containing continuous predictors, as in Section 3, but for the sake of notational simplicity, we have decided not to include them in (1).

We will now explain how the binarization of a given categorical predictor, say s, is performed. If s is an ordinal categorical predictor, we apply the approach in [5]. In this case, there is a natural order in the categories of s, which is used to define the so-called feasible clusterings of the categories. For a given threshold value \(\tau \), a feasible clustering is one in which the first \(\tau \) categories of s compose the first cluster, and the remaining ones the second cluster. By changing \(\tau \) appropriately, we obtain all possible feasible clusterings of the categories and the corresponding binarized representation of the ordinal variable s. Therefore, these ordinal predictors are not included in the discussion below.

In case s is a non-ordinal categorical predictor, we have a more complex relationship between the response variable and the predictors, with both marginal as well as interaction effects, and therefore the approach in [5] is not applicable. Thus, we will inspect the marginal effects and the interactions in (1) to build a dissimilarity matrix which can then be used in a clustering procedure to find two clusters for categorical predictor s.

Let us explain how we calculate the dissimilarity between the pair of categories b and c of predictor s. Category b is similar to category c if they affect the response variable in a similar way. We calculate this by estimating the GLM in (1) and comparing the marginal coefficients for b and c, as well as the coefficients associated with the interactions for these categories. Given the challenges of training a GLM with all possible interactions, where we would have, in general, an overparametrized model, we consider the interaction of s with another categorical predictor j. As we will see below, we iterate over all the possibilities j, having thus different dissimilarity matrices for s yielding a different binarization of s.

We are now ready to define the dissimilarity between the categories b and c of predictor s, when modeling the interaction between s and j:

$$\begin{aligned} \delta ^{(j)}_s(b,c)=(1-\lambda )\delta _s^{mar}(b,c) + \lambda \delta _s^{int}(b,c) , \end{aligned}$$
(2)

where \(\delta _s^{mar}(b,c)=|\beta _{b} - \beta _{c}|\) is the difference, in absolute terms, between the pair of marginal coefficients for b and c, \(\delta _s^{int}(b,c)\) is the \(\ell _1\) distance between the two interaction coefficient vectors, and \(\lambda \in [0,1]\). We place more weight on the information provided by the interaction coefficients the higher the value of \(\lambda \). Note that even when \(\lambda =0\), in which case the interaction coefficients do not play a role in (2), the dissimilarity still contains information from the interactions through the marginal coefficients, since they have been estimated from a model including these interactions.

Let \(\varvec{\delta }^{(j)}_s\) denote the dissimilarity matrix, which contains the dissimilarities between all possible pairs of categories of predictor s, when modeling the interaction between s and j. With \(\varvec{\delta }^{(j)}_s\), and using a clustering procedure, we can cluster the categories of s into two groups, such that categories in the same group affect the response variable in a similar way. These two groups yield a reduced representation of predictor s as a binary variable, where all categories in the same group now affect the response variable in the same way.

In Fig. 1 we illustrate this process when s is the categorical predictor Job from the German dataset, see Table 2 for the full list of predictors. For the sake of clarity, we have shortened the names of the categories of Job to their first word. We estimate the coefficients in the GLM in (1) with all marginal effects and the interactions between the categories of Job and the ones from another predictor, namely Housing, with three categories, namely Rent, Own and For free. The coefficients can be found in Fig. 1a. Note that Own has been chosen as the reference category for Housing, having thus a coefficient equal to zero, which explains the column of zeroes in the table in Fig. 1a. This is the same for the category Skilled of Job, in this case explaining the row of zeroes.

Then, we calculate the dissimilarity matrix \(\varvec{\delta }_{Job}^{(Housing)}\) using (2) with \(\lambda =0.5\), see Fig. 1b. We apply a hierarchical clustering procedure with the resulting clusters shown in Fig. 1c. With this, we find a reduced representation of predictor Job as a binary variable that takes on value 1 if Job is equal to Unemployed or Unskilled and 0 otherwise.

Our goal is to try out different binarizations of predictor s in order to find a good one in terms of accuracy. The dissimilarity matrix \(\varvec{\delta }^{(j)}_s\) depends on which interactions are incorporated in (1). In our example above, if instead of interacting Job with Housing, we interact it, for instance, with Status of existing checking account, we would have had a different dissimilarity matrix. By doing this for all possible predictors \(j \not = \textit{Job}\), we would have \(J-1\) dissimilarity matrices, \(\varvec{\delta }_{Job}^{(j)}\), \(j=1,\ldots ,J-1\). Then, we would have \(J-1\) different binarizations for the same categorical predictor that we could choose from, based on out-of-sample accuracy. After making the choice and binarizing the predictor using the corresponding clustering, we incorporate this reduced representation in the next decision to make, namely, the binarization of another categorical predictor.

Fig. 3
figure 3

Simulated dataset: coefficients in the data generating model

Table 3 Real-world datasets: accuracy and relative complexity in the validation set, for the LR without interactions, binarized LR, lasso and group lasso models

The pseudocode of our algorithm to binarize categorical predictors to model interactions can be found in Fig. 2. In lines 1 to 3, we initialize the parameters of the algorithm. In lines 7 to 17, we choose randomly the next predictor to binarize and estimate the coefficients in the GLM in (1) which includes all marginal effects and the interactions between the categories of s and another categorical predictor at a time. Then, we calculate the dissimilarity matrices and apply a clustering procedure to find different binarizations of the categorical predictor. In line 18, we estimate the coefficients in the GLM, in a similar fashion as before, but here we have s binarized. The binarization of s that gives the highest out-of-sample accuracy will be chosen, and the categorical predictor will be considered binarized in this way for the steps to come. Once all predictors are binarized, we train, in line 21, the binarized GLM, \(GLM_i^B\), including all binary predictors and we evaluate its performance in a validation set. Since the order in which we binarize predictors matters, we repeat the process m times and finally choose the final \(GLM_i^B\) that gives the highest out-of-sample accuracy.

3 Numerical illustrations

In this section, we illustrate the performance of our binarization methodology for categorical predictors to model interactions. We focus on supervised classification and use as a baseline the logistic regression (LR), where G in (1) is the logit function. The binarized LR, obtained with the algorithm in Fig. 2, is compared against LR, lasso and group lasso. These four models are trained with interactions between the categorical predictors. For completeness, we also include the LR in which only marginal effects are modeled, and refer to it as LR without interactions. As performance criteria, for each model we report its classification accuracy and its relative complexity, which is defined as the number of estimated coefficients for the categorical predictors and their interactions relative to the number of estimated ones for LR. With this, the lowest value of the relative complexity is equal to 0, when the categorical predictors do not play a role in the model.

Our message is twofold. First, we will illustrate that LR has a poor classification accuracy performance since the number of coefficients to estimate, \(\sum _{j=1}^J K_j + \sum _{1\le j<s \le J} K_j \cdot K_s\), can be very large compared to the number of records available, while for some of the combinations of categories from j and s there may not be enough records. Second, we will show that the accuracy of the binarized LR is comparable to that of the benchmarks, while the binarized LR outperforms them in terms of relative complexity, which is our measure of interpretability.

Fig. 4
figure 4

Real-world datasets: accuracy and relative complexity in the validation set, for the LR without interactions, binarized LR, lasso and group lasso models

Fig. 5
figure 5

German dataset: binarization of the categorical predictors. Note that \(X_{12}\) and \(X_{13}\) are already binary, i.e., \(B_{12} = X_{12}\) and \(B_{13} = X_{13}\), and therefore have not been included here

The algorithm in Fig. 2 has two parameters, namely the number of iterations m and the weight \(\lambda \). We chose \(m=200\) but, by looking at the output of all iterations, one could see that in these datasets a smaller number would have yielded almost the same results. As for \(\lambda \), after performing a sensitivity analysis, we decided to set it to 0.5. In our benchmark datasets, other choices returned similar values. We perform ten-fold cross-validation to select the final binarization of the categorical predictors. The output of this algorithm is further simplified using a stepwise selection routine in which we select the relevant marginal and interaction effects. To make the comparison fair, we apply this selection to LR and binarized LR, guided by the Akaike Information Criterion (AIC) measure. For lasso and group lasso, we perform ten-fold cross-validation to select the shrinkage parameter. For group lasso, we implement the version in [12] that considers interactions, the categories associated with each categorical predictor are part of the same group. We coded our algorithm in R and conducted the experiments in a Workstation with an Intel ® CoreTM i5-4460 processor with 8 GB of RAM.

The rest of this section is organized as follows. Section 3.1 describes the datasets, Section 3.2 is devoted to the analysis of the real-world datasets, and Section 3.3 to the simulated dataset.

Fig. 6
figure 6

German dataset: coefficients for the binarized model with interactions and their significance, where * indicates a p-value below 0.1, ** below 0.05, and *** below 0.01, before and after the stepwise variable selection procedure has been applied

3.1 Datasets

Our methodology is illustrated on four real-world datasets available at the UCI Machine Learning Repository [7] and one simulated dataset, see Table 1. In the first two columns, we report the name of the dataset and the total number of records (N). In the remaining columns, we report the response distribution, i.e., the percentage of observations with response \(Y=0\) and the percentage with \(Y=1\), the number of categorical predictors (J), which includes binary ones too, the number of continuous predictors (P), the total number of categories (\(\sum _{j=1}^J K_j\)), and the number of categories for each categorical predictor (\(K_j\)).

To illustrate the resulting binarized LR, we show the resulting coefficients and p-values for one of our real-world datasets, namely, the German dataset. In the German dataset, we try to classify people according to a set of predictors as good or bad in terms of creditworthiness. We have 967 records with 13 categorical predictors. Table 2 shows a summary of the categorical predictors in the dataset including the full name of the predictor, the number of categories, the four categories with the top counts, and whether the predictor is ordinal or not. Predictors \(X_{12}\) and \(X_{13}\) are already binary. Therefore, the first 11 categorical predictors are the ones that need to be binarized. The three ordinal predictors have been binarized using the methodology in [5]. The eight remaining ones are binarized using the algorithm in Fig. 2.

Now let us explain how we have designed the simulated experiment. We want to have clear groups of coefficients within each categorical predictor. The existence of clear groups would lead to an over-parametrized logistic regression if estimated using the one-hot dummy encoding. We generate 12, 000 records of 4 categorical predictors, drawn from a multinomial distribution with equal probabilities for each category, and 2 continuous predictors from a normal distribution with mean 0 and standard deviation 1. Our generating model has only one interaction effect, namely, between the first two categorical predictors. The response Y \(\in \{0,1\}\) is generated from the binomial distribution with probabilities obtained by applying the logistic regression model, using the coefficients in Fig. 3. The groups of categories are clear when we observe these coefficients. For example, categories b and c of predictor \(X_1\) share the same value of the coefficient, \(\beta _{1b}=\beta _{1c}=2\), while for a and d we have \(\beta _{1a}=\beta _{1d}=0\). In summary, there is an equivalent generating model where the four categorical predictors are binary, namely, \(B_1\) with coefficient equal to 2, \(B_2\) with 2, \(B_3\) with \(-1\) and \(B_4\) with 2.5, and one relevant interaction, namely, \(B_{1:2}\) with coefficient \(-6\). In Section 3.3, we will show that our algorithm is able to recover this equivalent binary generating model.

3.2 Real-world datasets

In this section, we illustrate the performance of our binarization algorithm in four real-world datasets in terms of accuracy and relative complexity. These estimates are obtained as follows: the dataset is split into a training sample (70%), a test sample (\(15\%\)), and a validation sample (\(15\%\)). The model is built in the training sample, we choose the binarization of the categorical predictors using the out-of-sample performance in the test sample and we report its final accuracy in the validation sample. The process is repeated ten times and we report as an estimate the average out-of-sample accuracy. A similar process is used for the benchmarks, LR without interactions, LR, lasso and group lasso.

The accuracy and the relative complexity can be found in Table 3 and in Fig. 4, both measured as a percentage. We can see that for all datasets, LR gives a lower accuracy than the binarized LR. This is because in the real-world datasets the number of records associated with each category is not evenly distributed, and hence, some categories have few observations which lead to even fewer observations for the interactions. In some cases, this is exacerbated by the small absolute number of observations, like in the German dataset, where the ratio of the number of coefficients associated with the categorical predictors, after deleting those for which we do not have records (379/967) makes training this model very challenging. In this dataset, the accuracy goes from \(62.19\%\) for LR to \(76.44\%\) for the binarized LR. This outperformance of the binarized LR can be seen in the other three real-world datasets too.

Table 4 Simulated dataset: accuracy and relative complexity in the validation set, for the LR without interactions, binarized LR, lasso and group lasso models
Fig. 7
figure 7

Simulated dataset: generating model (left) and coefficients of original model with 95% confidence intervals (right)

Fig. 8
figure 8

Simulated dataset: equivalent binary generating model (left) and coefficients of binarized model with 95% confidence intervals (right)

The relative complexity of binarized LR is competitive not only against LR but also when compared with the LR without interactions, which has much fewer coefficients to be estimated. For the German dataset, the relative complexity of the binarized LR is 3.64%, while 4.68% for LR without interactions. This outperformance is even more pronounced in the Coil2000 and Adult datasets where the binarized LR halves the relative complexity of the LR without interactions. In conclusion, we are able to model the interactions, as well as work with a much smaller model.

Comparing the binarized LR to lasso and group lasso, we find that our algorithm produces a model with higher accuracy for the datasets Adult and Bank marketing. For Coil2000 and German, our method performs similarly to lasso and group lasso in terms of accuracy. In terms of relative complexity, in three out of four datasets, our method results in a smaller model.

For the binarized LR model in the German dataset, Fig. 5 reports for each categorical predictor the two clusters of categories yielding the reduced representation as a binary predictor. For instance, let us look at the first and the last categorical predictors to be binarized, namely \(X_1\) defined as Status of existing checking account (with 4 categories) and \(X_{11}\) defined as Job (with 4 categories). For \(X_1\), binarized as \(B_{1}\), cluster 1 contains one category (\(... < 0 DM\)) and cluster 2 the remaining three (\({... < 200 DM, ... >= 200}\) DM/salary assignments at least 1 year, no checking account). For \(X_{11}\), binarized as \(B_{11}\), cluster 1 contains three categories (unemployed/unskilled - non-resident, unskilled - resident, skilled employee/official) and cluster 2 the remaining category (management/self-employed/highly qualified employee/officer). As pointed out in the introduction, it is now easier to interpret the role of Status of existing checking account and Job and their interaction, as this only involves looking at three coefficients, the ones of \(B_{1}\), \(B_{11}\), and \(B_{1:11}\). Figure 6 helps us visualize these coefficients.

Figure 6 provides information about the coefficients of the binarized LR model before the stepwise variable selection procedure has been applied (Fig. 6a) and after (Fig. 6b). In the diagonal of each matrix we find the marginal coefficients for the binary predictors and outside the diagonal the coefficients for their interactions when both binary predictors are set to one. Looking at Fig. 6b, we can see that there are 12 marginal coefficients, 2 are significant at the 1% level, and 4 more at the 5%. As for the interactions, there are 9 coefficients after a stepwise selection has been performed. From those, 3 are significant at the 1% level, 1 more at the 5%, and 2 additional ones at the 10%. At the 1% level, Savings account/bonds (\(B_{4}\)) and Other debtors / guarantors (\(B_{7}\)) are significant, as well as the interactions between Status of existing checking account and Job (\(B_{1:11}\)), Credit history and Other installment plans (\(B_{2:9}\)), and Present employment since and Property (\(B_{5:8}\)).

3.3 Simulated dataset

In this section, we discuss the results for the simulated dataset. As before, we split the data into a training sample (70%), a test sample (15%), and a validation sample (15%). The model is built in the training sample, the parameters are chosen using the test sample, and the performance is measured in the validation sample. We repeat the process ten times, and report average out-of-sample accuracy and relative accuracy.

Table 4 reports the accuracy and the relative complexity. The conclusions are similar, as before. While all models have a similar accuracy, the binarized LR outperforms the benchmarks in terms of relative performance. This means that our approach allows us to model the interactions, using a much smaller model. We end this section by illustrating how our binarization algorithm is able to recover the underlying generating model. On the right panel of Fig. 7, we plot the value of the coefficients and their 95% confidence intervals for the original model with interactions, while the left panel plots the values used by the data generating model. We plot similar information in Fig. 8 for the binarized LR. We see that we recover the generating model in both cases, while in the binarized LR the coefficients are estimated with a larger sample, resulting in smaller standard errors, as seen in the 95% confidence intervals around the coefficients.

4 Conclusions

In this paper, we have presented an approach to binarizing categorical predictors that enables working with interactions in Generalized Linear Models. Our approach offers several advantages. First, given that the samples of categories are homogeneous enough, by binarizing we can avoid having an over-parametrized model with a coefficient for each category. Second, we estimate just one coefficient for each categorical predictor and another one for each interaction. This gives a more interpretable model compared to having all the categories as binary variables. Third, by binarizing the categories we have more records to estimate each coefficient, which together with the homogeneity ensures lower standard errors. In the numerical section, we have used a simulated dataset and four real-world ones from supervised classification. In all these cases, our algorithm, in which the GLM with the logit function was used, considerably reduces the number of coefficients of the model, allowing the user to interpret and select interactions between the new binarized categorical predictors. We end by noting that, although for simplicity the methodology has been tested on supervised classification with the logistic regression as baseline, the very same approach is applicable to classification and regression tasks, as long as they are based on GLM.

A fruitful line of future work is related to the use of categorical predictors that contain sensitive information. In the future, our clustering methodology could take into account not only the overall accuracy but also a fair treatment of the sensitive groups [1, 13, 17]. Another interesting line of future research is the pursuit of metaheuristics that can deal with large-scale datasets involving an extremely large number of categories.