1 Introduction

Data science comprises a plethora of methods and strategies to extract useful knowledge from raw data. Among other disciplines, such as statistics, computer science or information technology, mathematical optimization plays a crucial role across its tasks (Bottou et al. 2018; Gambella et al. 2021; Olafsson et al. 2008). Much effort has gone into incorporating recent developments in optimization theory and software to tackle data science problems more effectively, such as in regression and classification (Bertsimas and King 2016; Bertsimas and Shioda 2007; Blanquero et al. 2020; Carrizosa and Romero Morales 2013; Carrizosa et al. 2021; Toriello and Vielma 2012), clustering strategies (Benati and García 2014; Carrizosa et al. 2013; Hansen and Jaumard 1997; Hochbaum and Liu 2018; Park et al. 2000; Sağlam et al. 2006), correspondence analysis (van de Velden et al. 2020), dimensionality reduction methods (Carrizosa and Guerrero 2014; Carrizosa et al. 2020; Cunningham and Ghahramani 2015), deep learning (Anderson et al. 2020; Fischetti and Jo 2018) or data visualization (Carrizosa et al. 2017a, 2018a, b, 2019).

There are still many data science problems which do not take advantage of such advancements and ad-hoc strategies are still used, such as, the analysis of the independence of two categorical variables through a function of the entries of their contingency table. Let U and V be two categorical variables, which take on a finite number of values, \(u_1,\ldots ,u_r\) and \(v_1,\ldots ,v_c\), respectively. Given a set of n entities for which these variables have been observed, a first summary of their distribution is provided by their contingency table, in which the frequency of the event \((u_i,v_j)\), \(o_{ij}\), is collected for \(i=1,\ldots ,r\) and \(j=1,\ldots ,c\). Table 1 contains an example of a contingency table in which \(n=98\) observations (lowest right corner) are cross-classified according to variable U,  which has categories \(u_1\) and \(u_2\) and variable V,  which has three categories (\(v_1,v_2\) and \(v_3\)). Whereas the inner rectangle in the table contains the joint frequencies \(o_{ij}\), the last row (resp. column) contains the marginal frequencies \(o_{.j}\) of V, \(j=1,\ldots ,c\) (resp. \(o_{i.}\) of U, \(i=1,\ldots ,r\)).

Table 1 Example of a contingency table

When the data is cross-classified as in Table 1, the statistical (in)dependence of two categorical variables is usually investigated using the classical \(\chi ^2\) measure (Pearson 1900; Mirkin 2001), although different approaches exist in the literature (Goodman and Kruskal 1979; Joe 1989). The \(\chi ^2\) coefficient is an estimate of the deviation between the empirical probability distribution of the variables U and V and the probability distribution that we would have if the two variables were statistically independent, and it is given by

$$\begin{aligned} \chi ^2 = \displaystyle \sum _{i=1}^{r} \sum _{j=1}^{c} \frac{(o_{ij}-e_{ij})^2}{e_{ij}}, \end{aligned}$$
(1)

where \(e_{ij} = \displaystyle \frac{o_{i.}o_{.j}}{n}\). The minimum value of \(\chi ^2\) in (1) is 0, which occurs if and only if the variables U and V are statistically independent. Therefore, the larger \(\chi ^2\), the stronger the evidence against independence. The \(\chi ^2\) statistic in (1) approximates a Chi-squared distribution with \((r-1)\times (c-1)\) degrees of freedom. In Table 1, one gets \(\chi ^2 = 6.355\), which provides evidence against the statistical independence of the two variables involved (\(p-\)value\(=0.04,\) which tests the null hypothesis of statistical independence) for a \(5\%\) significance level. These inferential properties can be derived as long as the observed joint frequencies \(o_{ij}\) are large enough for all \(i=1,\ldots , r\) and \(j=1,\ldots ,c\). To ensure this, the categories of the variables, associated with the rows and columns of the table, are often grouped, yielding a less granular representation of the categorical variables. Clustering categories in rows and/or columns of a contingency table is also desirable to enhance interpretability and transparency (Baesens et al. 2003; Carrizosa et al. 2017b, 2022; Goodman and Flaxman 2017; Ustun and Rudin 2016), by easing the presentation of the table as well as the conclusions of the analysis from a statistical perspective. Furthermore, constrained clustering allows the analyst to incorporate knowledge about the problem under study and support meaningful decision making (Abin 2019; Śmieja and Wiercioch 2017).

Table 2 Contingency table resulting from grouping together the categories \(v_1\) and \(v_3,\) yielding category \(v_1',\) in Table 1

However, it is known that the conclusions on independence depend, in general, on the granularity chosen for each of the categorical variables. For instance, let us consider that variable V in the example above is encoded as \(v'_1=\) \(v_1\) & \(v_3\) and \(v'_2=\) \(v_2\). Thus, observations in \(v_1\) and \(v_2\) are now grouped together, yielding the contingency table in Table 2, for which \(\chi ^2 = 3.519\). Thus, the clustered table has \(c'=2\) columns instead of the \(c=3\) in the initial table and is the one yielding the largest \(\chi ^2\) among all the tables of that granularity, namely with two columns and two rows. In this case, there is not a significant evidence at a \(5\%\) significance level to reject the statistical independence assumption between the variables U and V (\(p-\)value\(=0.06\)). Therefore, the Simpson’s paradox (Blyth 1972) arises in this example since the less granular representation of the categorical variables in Table 2 supports a conclusion, namely statistical independence, different from that suggested by the variables before the grouping of categories, namely statistical dependence in Table 1 (Shmueli and Yahav 2017; Tsumoto 2009). Thus, we have identified a so-called extreme grouping, namely Table 1 has the largest granularity of the variables U and V for which statistical dependence between them is found.

In this paper we propose a mathematical optimization model to cluster the rows/columns of a contingency table so that the \(\chi ^2\) statistic is maximized for a fixed granularity of the variables. Solving this problem for different sizes allows us to either identify extreme groupings or conclude that they do not exist. Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. This constrained clustering approach allows us to incorporate background knowledge to support the analysis and extract meaningful conclusions (Abin 2019; Śmieja and Wiercioch 2017).

The problem of clustering the categories of a contingency table to find extreme groupings for a fixed granularity has not been studied as such in the literature. There are however related approaches that involve distributional assumptions or use ad-hoc heuristic procedures that are not flexible enough to include constraints on the clusters. Indeed, Greenacre (1988) proposed a greedy procedure, based on hierarchical clustering, which uses a \(\chi ^2\) related distance function between row (resp. column) vectors. However, this approach does not guarantee that clustered tables with the highest dependence for a fixed granularity are necessarily found because only a reduced family of allowable groupings, namely a hierarchical structure, are considered. The classical k-means clustering algorithm has been also adapted to the particular case of contingency tables (Govaert 1995; Govaert and Nadif 2007) as well as geometrical approaches, such as the maximum-tangent-plane (Bock 2003), have been developed. Ciampi et al. (2005) propose using the coordinates obtained with correspondence analysis to find a clustering and Álvarez de Toledo et al. (2018) use a similarity measure between the categories to obtain a partition. In order to find homogeneous clusters in document-term matrices, Ailem et al. (2016) propose maximizing a graph modularity criterion and Labiod and Nadif (2011) a community detection one. Whereas these approaches are based on the optimization of a measure of association, some probabilistic approaches have been also studied. In this case, it is assumed that each element of the contingency table is generated according to a probability model, which is tried to be recovered from the data. In this context, Ailem et al. (2017a, 2017b); Riverain and Nadif (2022) propose latent block models to identify a diagonal structure of homogeneous blocks in document-term matrices. Proceeding this way, blocks of zeroes are identified and clustered together, thus yielding joint frequencies in the clustered table which are (close to) zero. A unified framework about the optimization of measures of association and probabilistic approaches is studied by Govaert and Nadif (2010). The aforementioned approaches are unable to deal with the analysis of dependence between variables in sparse tables, namely tables for which some of the observed joint frequencies are equal or close to zero and thus statistical conclusions cannot be inferred. It is well known that the common practice of adding constants to small joint frequencies can disturb the possible statistical dependence structure underlying in a sparse table (Agresti and Yang 1987). If some categories were properly clustered together, this sparsity problem would be removed without damaging the underlying possible relationship between the variables. However, the existing methods cannot incorporate the corresponding constraints to ensure that this removal of sparsity is achieved.

The idea of clustering categories in contingency tables has been also applied to the discretization of continuous variables involved in supervised learning algorithms. To guide the search of the partition, a criterion which assesses the relationship between the intervals in which the continuous variable is split and the target values to be predicted is optimized. For instance, Kerber (1992) uses the \(\chi ^2\) distance between adjacent intervals to merge them if they are similar enough according to a given threshold. Boulle (2004) proposes a greedy approach and uses the \(p-\)value associated to the \(\chi ^2\) statistic of the clustered table to select the discretization. However, these methods fail when constraints have to be imposed to the discretization being sought, such as that each interval in the partition has to have a large enough number of observations or there are rules that have to be accomplished (e.g. minimum or maximum length of the intervals which form the discretization).

In this paper, we propose an assignment and a set partitioning mathematical optimization formulations to cluster rows and/or columns of contingency tables maximizing the \(\chi ^2\) statistic in (1), as a measure of the strength of the dependence, for a fixed size of the clustered table. Solving this model for different sizes, we can decide whether the statistical dependence can be preserved with the chosen granularity of the variables. If this is the case, we reduce the size of the parameter and solve the maximization problem again. We do this until we find the extreme groupings, or conclude that they do not exist, namely for any size of the reduced table the dependence of the variables can be preserved. Our model can easily be enriched with constraints to incorporate user knowledge on the allowable groups of categories, or to successfully handle sparse tables. With the proposed formulations, even contingency tables as the ones in the numerical section can be tackled using off-the-shelf optimization solvers.

The remainder of the paper is structured as follows. Section 2 states the mathematical optimization model to cluster categories in a contingency table maximizing the \(\chi ^2\) statistic and imposing structural properties in the clusters. An assignment and a set partitioning formulations for such model are presented in Sect. 3. Finally, Sect. 4 illustrates our methodology and Sect. 5 concludes the paper with some remarks and future research.

2 Problem definition

This section is devoted to presenting a mathematical optimization model to cluster the rows and/or columns of a given contingency table which maximizes the \(\chi ^2\) statistic in (1) for a fixed granularity of the categorical variables whereas requirements on the clusters, that is conditions about allowable groups of categories or thresholds over the sample sizes in the cells of the clustered table, are also imposed.

Let \(T_0\) be a contingency table representing the counts of outcomes of two categorical variables U and V, which both take a finite set of values (categories), \(u_1,\ldots ,u_r\) and \(v_1,\ldots ,v_c\), respectively. Recall that given a sample of n entities, \(o_{ij}\) denotes the joint observed frequency of the pair \((u_i,v_j)\), \(o_{i.}\) the marginal frequency of \(u_i\) and \(o_{.j}\) the marginal frequency of \(v_{j}\), for \(i=1,\ldots ,r\) and \(j=1,\ldots ,c\). In order to measure the strength of the association between the variables U and V, the \(\chi ^2\) statistic as stated in (1) is used. Let \(\chi ^2(T_0)\) be the value of (1) for data in table \(T_0\).

Given the contingency table \(T_0\), a clustered table T is obtained from it by merging the rows and/or columns of \(T_0\) into a new set of categories (clusters). In other words, a set of k (\(k\le c\)) clusters of columns of \(T_0\), \(\{{\tilde{v}}_1,\ldots ,{\tilde{v}}_k\}\), is a partition of the set \(\{v_1,\ldots ,v_c\}\) into k groups such that, for \(l,l'=1,\ldots ,k\):

  • \({\tilde{v}}_l\subseteq \{v_1,\ldots ,v_c\}\),

  • \(\displaystyle \bigcup \nolimits _{l=1}^k {\tilde{v}}_l = \{v_1,\ldots ,v_c\}\),

  • \({\tilde{v}}_l\cap {\tilde{v}}_{l'} = \emptyset \), \(l\ne l'\).

Similarly, row clusters can be also defined as \({\tilde{u}}_1,\ldots ,{\tilde{u}}_s\) (\(s\le r\)). The clustered contingency table T from \(T_0\) has a less granular representation of its categorical variables U and V and has as joint frequencies \({\tilde{o}}_{ml}\), which are obtained from the sum of the corresponding joint frequencies in \(T_0\), namely \({\tilde{o}}_{ml} =\displaystyle \sum \nolimits _{\begin{array}{c} i:\, u_i\in {\tilde{u}}_m \\ j:\, v_j\in {\tilde{v}}_l \end{array}} o_{ij}\). In other words, a clustered table T accumulates the corresponding frequencies in \(T_0\). Let \(\chi ^2(T)\) be the value of (1) in table T.

Clustering the rows and/or columns of a contingency table reduces the value of the \(\chi ^2\) statistic, that is \(\chi ^2(T)\le \chi ^2(T_0)\) (see Govaert and Nadif (2018) for a detailed proof). Therefore, to see whether we can preserve the dependence structure between the variables U and V when their categories are clustered, we seek, for a fixed size, the clustered table T which maximizes \(\chi ^2(T)\). Repeating this procedure for different values of the granularity, we can either identify an extreme grouping or conclude that it does not exist, namely the two variables are dependent regardless of the size of the clustered table. Indeed, in the event of obtaining a clustered table T so that statistical dependence is assumed and its clustered table exhibits independence, then we say that T is an extreme grouping.

In order to obtain new categories in table T,  namely the clusters, which are meaningful for the analyst, being able to incorporate prior knowledge about the groups of categories which are allowed or not to be merged would be helpful. In other words, not every possible combination of categories is allowed. Besides easing the interpretability, clustering can be used to deal with sparsity issues in the entries of \(T_0\) looking for aggregations of columns and/or rows which accumulate at least a certain number of observations. Let \({\mathcal {T}}(T_0)\) be the set of all possible contingency tables resulting from allowable groups of rows and columns of \(T_0\).

The problem of Clustering a Contingency Table (CCT) described above is stated as the following combinatorial optimization problem:

$$\begin{aligned} \displaystyle \max _{T}&\qquad \chi ^2(T) \\ \text{ s.t. }&\qquad T\in \mathcal{T}(T_0). \end{aligned}$$
(CCT)

(CCT) seeks the table \(T\in {\mathcal {T}}(T_0)\) that maximizes the strength of the association between the variables in the clustered table T measured through the \(\chi ^2\) statistic in (1), and satisfies the structure imposed on the clusters through the definition of the feasible set \({\mathcal {T}}(T_0)\). (CCT) is a combinatorial problem for which an assignment (0–1 nonlinear) formulation is proposed in the next section. Other formulations are also possible, such as a set partitioning one, which is stated in Sect. 3.3.

3 An assignment formulation and its set partitioning counterpart

This section is devoted to developing a mathematical optimization formulation for the (CCT) model stated in Sect. 2. An assignment formulation is proposed in Sect. 3.1, in which the decisions to be made are whether a column of the observed contingency table \(T_0\) is assigned to a cluster of categories in the clustered table T or not, yielding a \(0-1\) nonlinear optimization model. Section 3.2 is devoted to formally model some structures which could be demanded to the clustered table T to, for instance, get meaningful clusters by incorporating expert knowledge on the allowable groupings or reduce sparsity. Recall that these conditions naturally arise from the problem under study. Finally, a set partitioning formulation is proposed in Sect. 3.3.

Clustering a contingency table \(T_0\) can be done either row-wise (only the rows are clustered while the initial columns in \(T_0\) are maintained), column-wise (only the columns are clustered while the initial rows in \(T_0\) are maintained), or in both directions, this is, column and rows are both clustered into new categories. Whereas our approach is valid for any of these three options, the assignment formulation for (CCT) and its extensions are fully developed column-wise for the sake of clarity.

3.1 The assignment formulation for (CCT) with k clusters

Recall that \(\{v_1,\ldots ,v_c\}\) is the set of categories (columns) of variable V in the observed contingency table \(T_0\). The categories in \(T_0\) are aimed to be clustered into k new categories, named as \({\tilde{v}}_1,\ldots ,{\tilde{v}}_k\), \(k\le c\) in such a way that the categories in T form a partition of the ones in \(T_0\) and the \(\chi ^2\) statistic of the clustered table, that is \(\chi ^2(T)\), is maximized.

Let \(y_{jl}\) for all \(j\in \{1,\ldots ,c\}\) and \(l\in \{1,\ldots ,k\}\) be a binary decision variable defined as

$$\begin{aligned} y_{jl}=\left\{ \begin{array}{ll} 1 &{} \text{ if } \text{ the } j\text{-th } \text{ category } \text{ in } T_0 (v_j), \text{ is } \text{ assigned } \text{ to } \text{ the } l\text{-th } \text{ category } ({\tilde{v}}_l), \text{ in } T, \\ \\ 0 &{} \text{ otherwise }. \end{array}\right. \end{aligned}$$

Let \(\chi ^2\left( \{ y_{jl} \}_{\begin{array}{c} j\in \{1,\ldots ,c\} \\ l\in \{1,\ldots ,k\} \end{array}}\right) \) be the \(\chi ^2\) statistic for the clustered table T, which is defined by the y-variables and (1) as:

$$\begin{aligned} \chi ^2\left( \{ y_{jl} \}_{\begin{array}{c} j\in \{1,\ldots ,c\} \\ l\in \{1,\ldots ,k\} \end{array}}\right) = \displaystyle \sum _{l=1}^k\sum _{i=1}^{r} \frac{\left( \displaystyle \sum \nolimits _{j=1}^{c}(o_{ij}-e_{ij})y_{jl}\right) ^2}{\displaystyle \sum \nolimits _{j=1}^{c}e_{ij}y_{jl}} \end{aligned}$$

The problem of clustering the columns of a contingency table \(T_0\) into k clusters maximizing the \(\chi ^2\) statistic is stated as a 0–1 nonlinear optimization model, which consists of the maximization of a convex function subject to linear constraints, as follows:

$$\begin{aligned} \displaystyle \max&\qquad \chi ^2\left( \{ y_{jl} \}_{\begin{array}{c} j\in \{1,\ldots ,c\} \\ l\in \{1,\ldots ,k\} \end{array}}\right) \end{aligned}$$
(2)
$$\begin{aligned} \text{ s.t. }&\qquad \displaystyle \sum _{l=1}^k y_{jl} =1 ,\,\,\, j=1,\ldots ,c, \end{aligned}$$
(3)
$$\begin{aligned}&\qquad \displaystyle \sum _{j=1}^c y_{jl} \ge 1 ,\,\,\, l=1,\ldots ,k, \end{aligned}$$
(4)
$$\begin{aligned}&\qquad y_{jl} \in \{0,1\}\,\,\,\, j=1,\ldots ,c,\, l=1,\ldots , k. \end{aligned}$$
(5)

Constraint (3) ensures that each category in \(T_0\) goes to just one of the new categories (clusters) in T and constraint (4) imposes that each cluster has at least one category. Finally, constraint (5) defines the binary nature of y-variables. Note that problem (2)–(5) can be enriched with constraints to break the symmetry associated with the clusters.

3.2 Modelling some clustering structures

The assignment formulation in (2)–(5) provides a flexible framework to incorporate additional requirements on the clusters in table T in a straightforward manner, namely as additional linear constraints. We describe in what follows some of the most natural cases, although more complex structures, such as the constrained discretization of continuous variables for supervised learning algorithms, can also be modeled using the y-variables in the assignment formulation stated above.

  • Non-sparsity constraints: In contingency tables analysis usually happens that some of the observed joint frequencies are equal or close to zero, and thus statistical conclusions cannot be inferred from the distribution of the \(\chi ^2\) statistic. In order to be able to apply statistical inference theory, sparsity problems in the observed table \(T_0\) might be mitigated by clustering some of its columns by imposing a threshold over the number of observations in each row of the new column (cluster). In other words, the user might require that in each row of the columns in the new table T there are at least \(\beta \) observations. A common value for \(\beta \) is 5. Such condition can be added as a constraint to problem (2)–(5) as follows:

    $$\begin{aligned} \displaystyle \sum _{j=1}^{c} o_{ij}y_{jl} \ge \beta , \,\,\, i=1,\ldots ,r,\,l=1,\ldots , k. \end{aligned}$$
    (6)
  • Cannot-link constraints: A cannot-link constraint is used to specify that two or more specified categories in \(T_0\) cannot be associated with the same cluster in T. In its simplest case, namely two categories \(v_j\) and \(v_{j'}\) in \(T_0\) cannot be grouped together in T, the cannot-link constraint is modeled as

    $$\begin{aligned} y_{jl} + y_{j'l} \le 1 , \,\,\, l=1,\ldots , k. \end{aligned}$$
    (7)

    Condition (7) can be easily generalized to accommodate groups of categories in \(T_0\) which cannot belong to the same cluster.

    A complementary set of conditions to cannot-link ones are the so-called must-link constraints, which are used to specify that two or more specified categories in \(T_0\) must be assigned to the same cluster in T. Although these kind of conditions could be also easily modeled in a similar fashion, they can be imposed in a preprocessing step.

  • Relational constraints: There might be structural conditions among categories which are more complex than the ones given by cannot or must-link constraints. That is the case of, for instance, the existence of a partial order relation \(\prec \) between the categories implying that, if two categories belong to one cluster then all the categories in-between must belong to the same cluster too. In its simplest case, namely two categories \(v_j\) and \(v_{j'}\) in \(T_0\) such that \(v_j\) precedes \(v_{j'}\) in the partial order, the so-called relational constraint is modeled as

    $$\begin{aligned} y_{jl} + y_{j'l} \le y_{j''l} + 1 , \,\,\, \text{ for } v_j\prec v_j'' \prec v_j' \text{ and } l=1,\ldots , k. \end{aligned}$$
    (8)
  • Demand / capacity constraints: We may also require that each column l (clusters) in T contains at least \(a_l\) categories of \(T_0\) and/or no more than \(b_l\), thus establishing demand and/or capacity constraints, respectively, for \(l=1,\ldots ,k\). Such conditions can be added as constraints to problem (2)–(5) as follows:

    $$\begin{aligned} \displaystyle a_l\le \sum _{j=1}^{c} y_{jl} \le b_l, \,\,\, l=1,\ldots , k. \end{aligned}$$
    (9)
  • ‘et al.’ clustering: Given a contingency table \(T_0\) with c columns, the analyst might be interested in obtaining a clustered table T with k columns in which \(k-1\) of its categories are exactly \(k-1\) of the categories in \(T_0\) and the k-th category is made up of the aggregation of the remaining \(c-k+1\) categories in \(T_0\), that is the ‘et al.’ category. This structure is a particular case of (CCT) with k clusters, in which \(k-1\) clusters are singletons and the k-th category in the new table comprises \(c-(k-1)\) categories. In order to get such structure in T, constraint (4) in the formulation (2)–(5) for (CCT) must be replaced by

    $$\begin{aligned}&\displaystyle \sum _{j=1}^c y_{jl} = 1 ,\,\,\, l=1,\ldots ,k-1, \end{aligned}$$
    (4a)
    $$\begin{aligned}&\displaystyle \sum _{j=1}^c y_{jk} = c-k+1. \end{aligned}$$
    (4b)

    Nevertheless, the number of variables and constraints in the optimization problem defined by (2), (3), (4a), (4b) and (5) can be significantly reduced if the following variables are considered instead:

    $$\begin{aligned} y_{j}=\left\{ \begin{array}{ll} 1 &{} \text{ if } \text{ the } j\text{-th } \text{ category } \text{ in } T_0 (v_j) \text{ is } \text{ in } T \text{ as } \text{ a } \text{ singleton, } \\ \\ 0 &{} \text{ otherwise }. \end{array}\right. \end{aligned}$$

    Using this new definition of y-variables, the \(\chi ^2\) statistic in (1) is rewritten as

    $$\begin{aligned} \chi ^2(\{y_j\}_{j\in \{1,\ldots ,c\}}) = \displaystyle \sum _{i=1}^{r}\left\{ \sum _{j=1}^{c} \frac{(o_{ij}-e_{ij})^2}{e_{ij}}y_j + \displaystyle \frac{\left( \displaystyle \sum \nolimits _{j=1}^{c} (o_{ij}-e_{ij})(1-y_j) \right) ^2}{\displaystyle \sum \nolimits _{j=1}^{c}e_{ij}(1-y_j)}\right\} . \end{aligned}$$

    Therefore, the 0–1 nonlinear formulation for the (CCT) problem with the ‘et al.’ structure (2), (3), (4a), (4b) and (5) is rewritten as

    $$\begin{aligned} \displaystyle \max&\qquad \chi ^2(\{y_j\}_{j\in \{1,\ldots ,c\}}) \end{aligned}$$
    (10)
    $$\begin{aligned} \text{ s.t. }&\qquad \displaystyle \sum _{j=1}^c (1- y_{j}) = c-k+1 ,\,\,\, \end{aligned}$$
    (11)
    $$\begin{aligned}&\qquad y_{j} \in \{0,1\}\,\,\,\, j=1,\ldots ,c. \end{aligned}$$
    (12)

    Constraints (11) and (12) control the number of categories in table \(T_0\) which compose the ‘et al.’ category in table T and the binary nature of the y-variables, respectively.

3.3 A set partitioning formulation for (CCT)

In this section, an alternative formulation is proposed for (CCT), which assumes that there is a list of permissible aggregations of columns in \(T_0\) that can be used to build the clustered table T. Given such a list, a set partitioning formulation is proposed whose benefits with respect to an assignment one are twofold: first, its continuous relaxation is, in general, tighter (Freling et al. 2003), and second, the so-obtained formulation becomes 0–1 linear.

In order to state a set partitioning formulation for (CCT), let \(\left\{ S_1,\ldots ,S_K \right\} \) be a family of K subsets of the categories in \(T_0\) given by \(\{v_1,\ldots ,v_c\}\). These subsets represent the list of allowable aggregations of columns in \(T_0\), that is the list of clusters which can be used to build the columns in the clustered table T. Let A be a \(c\times K\) 0–1 matrix with entries \(a_{jp}\) for all \(j\in \{1,\ldots ,c\}\) and \(p\in \{1,\ldots ,K\}\) defined by

$$\begin{aligned} a_{jp}=\left\{ \begin{array}{ll} 1 &{} \text{ if } v_j\in S_p, \\ \\ 0 &{} \text{ otherwise }, \end{array}\right. \end{aligned}$$

and let \(x_{p}\) for all \(p\in \{1,\ldots ,K\}\) be a binary decision variable defined by

$$\begin{aligned} x_{p}=\left\{ \begin{array}{ll} 1 &{} \text{ if } S_p \text{ is } \text{ a } \text{ column } \text{ of } \text{ the } \text{ clustered } \text{ table } T, \\ \\ 0 &{} \text{ otherwise }. \end{array}\right. \end{aligned}$$

Let \(\chi ^2(\{x_p\}_{p\in \{1,\ldots ,K\}})\) be the \(\chi ^2\) statistic for the clustered table stated as

$$\begin{aligned} \chi ^2(\{x_p\}_{p\in \{1,\ldots ,K\}}) = \displaystyle \sum _{i=1}^{r} \sum _{p=1}^{K}\frac{\left( \displaystyle \sum \nolimits _{j=1}^c(o_{ij}-e_{ij})a_{jp}\right) ^2}{\displaystyle \sum \nolimits _{j=1}^c e_{ij} a_{jp}} x_{p}. \end{aligned}$$

Then, the set partitioning formulation for problem (CCT) is stated as:

$$\begin{aligned} \displaystyle \max&\qquad \chi ^2(\{x_p\}_{p\in \{1,\ldots ,K\}}) \end{aligned}$$
(13)
$$\begin{aligned} \text{ s.t. }&\qquad \displaystyle \sum _{p=1}^K a_{jp}x_p =1,\,\,\,\, j=1,\ldots ,c \end{aligned}$$
(14)
$$\begin{aligned}&\qquad x_p \in \{0,1\}\,\,\,\, k=1,\ldots , K. \end{aligned}$$
(15)

Whereas (14) ensures that each column of \(T_0\) belongs to just one cluster in T, constraint (15) imposes the binary nature of the x-variables. We point out that (13)–(15) is a 0–1 linear optimization problem, which can easily accommodate the structures discussed in Sect. 3.2. For instance, the relational condition requiring that if columns \(v_j\) and \(v_{j'}\) in \(T_0\) belong to the same cluster in table T then column \(v_{j''}\) belongs also to that cluster is defined through one of the S-sets as \(\{v_j,v_{j'},v_{j''}\}\).

Table 3 Main features of the contingency table from Kandoth et al. (2013) used to illustrate our approach

4 Illustrative examples

In order to illustrate the methodology proposed in this paper, a contingency table \(T_0\) obtained from a medical study by Kandoth et al. (2013) is considered. This table comprises information about \(n=9786\) biological samples and its joint frequencies correspond to the number cross-classified cases of the categorical variables cancer type (U), which has \(r=11\) categories, and significantly mutated gene (V), which has \(c=127\) categories. The set of genes are divided into 20 groups, which are defined according to biological features. This contingency table can be obtained from the supplementary material in Kandoth et al. (2013) and it is also included in Tables 8 and 9 in the Appendix. In Table 3 we show, for each of the groups \(g=1,\ldots , 20\): its description according to Kandoth et al. (2013), a color to represent it in the upcoming results, the number of genes (categories) in each group (\({\mathcal {S}}_g\)), and the percentage of cells the contingency table within each group which are sparse (that is, for each \(i=1,\ldots ,11\) the cardinality of \(\{o_{ij}:\, o_{ij} < 5,\, j=1,\ldots ,{\mathcal {S}}_g\}\) divided by \({\mathcal {S}}_g\) times 100). We point out the noticeable amount of joint frequencies which are below the usual threshold of 5, being the level of sparsity greater than \(49\%\) within all the groups. For the interpretation of references to color in Table 3, the reader is referred to the web version of this article.

Table 4 Assignment of genes in \(T_0\) to clusters in T solving model (2)–(6) (Part I)
Table 5 Assignment of genes in \(T_0\) to clusters in T solving model (2)–(6) (Part II)
Table 6 Assignment of genes in \(T_0\) to clusters in T solving model (2)–(7) (Part I)
Table 7 Assignment of genes in \(T_0\) to clusters in T solving model (2)–(7) (Part II)

In what follows, we present results obtained from clustering the contingency table of Kandoth et al. (2013) using two different clustering structures. The optimization models involved in the experiments have been solved using Bonmin (Bonami and Lee 2017) under Pyomo. Bonmin is an open-source numerical optimization procedure for solving general Mixed Integer Nonlinear Programs by means of Branch-and-Bound and Branch-and-Cut algorithms, thus avoiding an explicit complete enumeration of all the feasible solutions of the models stated in Sect. 3. In order to avoid being stuck at local optima, each problem has been solved 10 times with a time limit of one hour in a PC Intel Core\(^{\hbox {TM}}\) i7-7700, 16GB of RAM.

As abovementioned, the contingency table \(T_0\) in Kandoth et al. (2013) is highly sparse. It is well known that in such case the asymptotic distribution of the \(\chi ^2\) statistic fails and thus statistical conclusions about the statistical dependence between U and V cannot be inferred (Agresti and Yang 1987). In order to overcome such limitation, a less granular representation of the genes can be considered so that the columns of table \(T_0\) (genes) are clustered into broader categories (groups of genes) in such a way that the aggregated joint frequencies are larger than a threshold, and thus an eventual statistical dependence between U and V in the original table \(T_0\) could be revealed. To do so, the optimization model defined by (2)–(6) is solved for \(\beta =5\). Tables 4 and 5 contain the assignment of the genes in \(T_0\) to the clusters (new categories made up of groups of genes) in T for the number of clusters k varying from 2 to 10. Thus, a initial table \(T_0\) with \(r=11\) and \(c=127\) is reduced to tables T of \(r=11\) and \(c=k\), for \(k=2,\ldots ,10\). First column of Tables 4 and 5 contains the values of the \(\chi ^2\) statistic in T (and the associated \(p-\)value in parentheses). We point out that statistical dependence between U and V is detected when the granularity of V is fixed to \(k=2,\ldots ,10\) and a significance level of \(\alpha =5\%.\) Thus, we can conclude that, under the aforementioned conditions, there does not exist a extreme grouping since the null hypothesis of statistical independence is rejected for all \(k\ge 2.\)

The genes in \(T_0\) are split into 20 groups defined through biological features. A plausible requirement could be that the genes of some groups in \(T_0\) must belong to the same clusters in T to avoid having genes of the same groups spread out across different clusters as in Tables 4 and 5. These must-link conditions can be imposed in a preprocessing step. In this case, we impose that genes in Groups 1, 2, 4 and 6, respectively, must belong to the same cluster. In addition, our preprocessing incorporates the requirement that Groups 4 and 6 belong to the same cluster. Conversely, some groups might be required to belong to different clusters. This structure is illustrated by imposing constraint (7) for \(j\in \) Group 1 and \(j'\in \) Group 2. Tables 6 and 7 contain the assignment of the genes in \(T_0\) to the clusters (new categories) in T for the number of clusters k varying from 2 to 10. As before, the first column of such tables contains the values of the \(\chi ^2\) statistic in the clustered table T (and the associated \(p-\)value in parentheses). In this case, statistical dependence between U and V is also detected when the granularity of V is fixed to \(k=2,\ldots ,10,\) a significance level of \(\alpha =5\%\) is considered as well as the group structures in the clustering process. Thus, we can conclude that, under the aforementioned conditions, there does not exist a extreme grouping since the null hypothesis of statistical independence is rejected for all \(k\ge 2.\)

The contingency tables obtained form the clusterings shown in Tables 4, 5, 6 and 7 are depicted in the Supplementary Material. We refer the reader to the web version of this article for the interpretation of references to colors in Tables 4-7.

5 Conclusions

In this paper we have addressed the problem of clustering categories in contingency tables maximizing the \(\chi ^2\) statistic (Mirkin 2001; Pearson 1900). Solving this clustering problem for different sizes, namely different granularities of the categorical variables under study, allows us to identify extreme groupings or, in other words, the way categories can be clustered into larger ones so that the dependence of the variables is no longer detected if the granularity of the variables is reduced. To do so, a combinatorial mathematical optimization model has been stated, which allows to accommodate structural properties of the clusters in the clustered table which naturally arise in the context of the dataset under study. An assignment formulation has been proposed for such model, namely (CCT), yielding a 0–1 nonlinear optimization problem. Requirements on the clusters, such as non-sparsity conditions, relational and cannot link constraints, have been stated as linear constraints. In addition, a set partitioning reformulation of (CCT) is also proposed. Our methodology is illustrated using a dataset in a medical study, which naturally demands the use of the tools proposed in this paper to handle the study of statistical dependence between its variables under structural conditions on the clusters.

The problem studied in this paper can be extended in a few directions. First, other criteria different from \(\chi ^2\) could be considered to measure statistical dependence (Goodman and Kruskal 1979; Joe 1989). Second, exploring different criteria to group the categories in contingency tables different from statistical dependence, and which are defined through appropriate combinatorial optimization models, could be also explored as an extension to this paper. Some interesting examples could be to explore patterns in the observed joint frequencies to group the categories in a contingency table, or to identify those patterns in the coordinates given by Correspondence Analysis (Ciampi et al. 2005; Pledger and Arnold 2014; van de Velden et al. 2020). Third, tighter formulations for (CCT) could be explored, in combination with metaheuristic approaches such as the Variable Neighborhood Search (Mladenović and Hansen 1997) or the Large Neighborhood Search (Pisinger and Ropke 2010), to address larger tables. Finally, extensions of the proposed methodology for dealing with multi-way tables require further research (Agresti and Gottard 2007).