E-ReMI: Extended Maximal Interaction Two-mode Clustering

In this paper, we present E-ReMI, a new method for studying two-way interaction in row by column (i.e., two-mode) data. E-ReMI is based on a probabilistic two-mode clustering model that yields a two-mode partition of the data with maximal interaction between row and column clusters. The proposed model extends REMAXINT by allowing for unequal cluster sizes for the row clusters, thus introducing more ﬂexibility in the model. In the manuscript, we use a conditional classiﬁcation likelihood approach to derive the maximum likelihood estimates of the model parameters. We further introduce a test statistic for testing the null hypothesis of no interaction, discuss its properties and propose an algorithm to obtain its distribution under this null hypothesis. Free software to apply the methods described in this paper is developed in the R language. We assess the performance of the new method and compare it with competing methodologies through a simulation study. Finally, we present an application of the methodology using data from a study of person by situation interaction.


Introduction
This paper addresses the analysis of two-way two-mode data (Carroll & Arabie, 1980) that can be arranged in an I × J real-valued data matrix D, with elements d i j (i = 1, . . . , J , j = 1, . . . , J ), and in which the rows pertain to one of the two modes and the columns to the other. Specifically, we consider the case in which both modes are considered as categorical predictor variables and d i j is a real-valued outcome value for the combination of row i and column j. Such data abound in many research settings. An example is that of contextualized personality research, where a set of I persons is measured on some behavior of interest in J different situations. Other examples include the study of micro-array data in genome research clusters. Hence, if there are individual row (column) main effects, the row (column) partitions may tend to capture those effects so as to explain the associations between observations of any two columns (resp. between observations of any two rows) that are implied by those effects (see Section 5 for an illustration). In short, a block mixture model does not yield row and column partitions that maximize interaction but rather the sum of row and column main effects plus row by column interaction. In contrast, REMAXINT (Ahmed et al., 2021) assumes that the random variables d i j are conditionally independent, not only given R and C, but also the individual row and column main effects. Therefore, this method yields a twomode partition of D in which the row and column clusters explain the associations between observations of any two columns (resp. between observations of any two rows) over and above the associations that are implied by row/column main effects. REMAXINT extends maximal interaction two-mode clustering ( (Schepers et al., 2017), MAXINT), by allowing the rows to be random instead of fixed.
Other biclustering approaches that are based on interaction concepts include the methods proposed in Cheng & Church (2000) and Cho et al. (2004), which are widely applied to gene expression data. These methods look for biclusters with minimal within-bicluster interaction, that is, minimal mean sum-squared residue (Yu et al., 2021). Specifically, row and column main effects are bicluster-specific and within-bicluster row by column interaction is negligible. This implies that the total row by column interaction sum of squares is represented by the between-bicluster differences with regard to the bicluster means plus the between-bicluster differences with regard to the bicluster-specific main effects. In contrast, in MAXINT (and REMAXINT), row (resp. column) main effects are not specific to column (resp. row) clusters. The total row by column interaction sum of squares is therefore represented more easily by between-bicluster differences with regard to the bicluster means (i.e., biclusters with large between-bicluster interaction). A thorough discussion comparing biclustering methods based on interaction concepts can be found in Schepers et al. (2017).
In this paper, we focus on the objective of finding biclusters with large between-bicluster interaction. Specifically, we propose E-ReMI (Extended-ReMaxInt), a new two-mode clustering method that relaxes the unlikely assumption of equal population cluster sizes that is (implicitly) made in REMAXINT. E-ReMI is a model-based clustering method (Bock, 1996;Banfield & Raftery, 1993) that assumes interaction between row and column clusters in the sense that all row by column interactions are identical within each of the P Q pairs of row and column clusters (i.e., within each bicluster). In other words, it is assumed that any substantial row by column interaction in the data matrix (D) is attributable to a row by column cluster interaction. Our proposed method yields simultaneous partitions of the rows and columns of D such that a conditional likelihood criterion is maximized. The row and column clusters are not determined by differences between row and column main effects, respectively, but only by row by column interaction effects. Furthermore, row main effects are considered random, and row cluster sizes are allowed to vary between clusters and are unknown parameters to be estimated. Additionally, we introduce a pivotal likelihood ratio test, based on E-ReMI and Monte Carlo sampling, to test the null hypothesis of no interaction between the row and column clusters. Finally, in order to make the methodology of this paper available to a large public, we implement this method in the free software R. This implementation not only includes the newly proposed method but also codes for REMAXINT, which was previously only available in Matlab.
The remainder of this article is organized as follows: In Section 2 we formulate the statistical model and consider parameter estimation using a conditional likelihood approach. In Section 3, we propose a method for statistical inference on the interaction effect parameters that is based on a Monte Carlo scheme. Section 4 investigates, by means of a simulation study, the impact of relaxing the assumption of equal row cluster sizes on statistical power and parameter recovery. This study also includes a comparison, in terms of power, to some of the tests reviewed in Shenaravi & Kharrati-Kopaei (2018). Section 5 studies this impact on a real data set from a study in personality psychology. Finally, Section 6 is dedicated to a discussion and some final remarks.

Model Formulation
We consider a two-mode partitioning of an I × J two-mode real-valued data matrix D with elements d i j (i = 1, . . . , I , j =, 1, . . . , J ) to capture the gist of row by column interaction that is included in these data. We assume that the elements of the row set R = {1, . . . , I } are sampled independently from a population that includes P subpopulations with relative sizes ω p (0 ≤ ω p ≤ 1; p = 1, . . . , P and P p=1 ω p = 1). This results in a latent random Ppartition R = {R 1 , ..., R P } of R that is characterized by the random P-dimensional cluster indicator vectors Z 1 , ..., Z I , where, for every i, Z i = (Z i1 , . . . , Z i P ) T , with Z i p = 1 for i ∈ R p and Z i p = 0 for i / ∈ R p . This implies that Z i is distributed according to a Multinomial distribution that consists of one draw on P categories with probabilities ω 1 , . . . , ω P (i.e., Z i ∼ Mult P (1, ω), where ω = (ω 1 , . . . , ω P ) T ).
Furthermore, we assume that there is a latent fixed Q-partition C = {C 1 , ..., C Q } of the column set C = {1, . . . , J } that is characterized by the binary indicator matrix K, such that an element k jq = 1 for j ∈ C q and k jq = 0 for j / ∈ C q ( j = 1, . . . , J , q = 1, . . . , Q), with |C q | = J j=1 k jq . The row and column clusters are characterized by two features: 1. The clusters must be jointly exhaustive, i.e. p R p = R and q C q = C, for rows and columns, respectively; 2. The clusters must be mutually exclusive, i.e. R p R p = φ (∀ p = p ), and C q C q = φ (∀q = q ), for rows and columns, respectively.
A bicluster R p × C q is the Cartesian product of row cluster R p and column cluster C q and we denote a two-mode partition as R × C = {R p × C q ; p = 1, . . . , P, q = 1, . . . , Q}.
Let D 1 , . . . , D I denote a random sample of size I , where D i is a J -dimensional vector with probability density function f (d i ) on R J . We use D = ( D 1 , . . . , D I ) T to represent the entire sample. We denote by d = (d 1 , . . . , d I ) T an observed random sample, where d i = (d i1 , . . . , d i J ) T is the realisation of the random vector D i . Likewise, z i = (z i1 , . . . , z i P ) T is the realisation of the random vector Z i , with |R p | = I i=1 z i p . Given a two-mode partition R × C of D, each of its elements D i j , is assumed to be described by the following linear framework: for i ∈ R p , j ∈ C q , p = 1, . . . , P, q = 1, . . . , Q, and where μ is the overall mean, α i is a random main effect of row i, β j is a fixed main effect of column j and γ pq is a fixed interaction effect associated to bicluster R p ×C q . We assume known numbers of row and column clusters P and Q, respectively (we comment on this last assumption in the discussion). We assume that the row main effects α i are random (with E(α i ) = 0 and σ 2 α i > 0, ∀i) and that the column main effects β j are fixed (with J j=1 β j = 0). Furthermore, we impose identifiability constraints P p=1 ω p γ pq = 0 (q = 1, . . . , Q) and Q q=1 |C q |γ pq = 0 (p = 1, . . . , P) on the interaction effect parameters. Lastly, i j represents the random error term. The error terms are assumed to be i.i.d. across rows and columns, with mean zero and variance σ 2 . We assume that the residuals i j in (1) are Normally distributed. Then, conditionally on i ∈ R p and j ∈ C q , where the distribution function F(α i ) of the random effects α i does not need to be specified (see below). The random-partition model E-ReMI considers the data d 1 , . . . , d I as incomplete since the associated row cluster labels z 1 , . . . , z I are unobserved (i.e., missing). Note that the model assumes that, conditionally on z i and α i , the univariate random variables D i j ( j = 1, . . . , J ) are statistically independent (i.e., local stochastic independence). Let g i j (d i j |α i , μ, β j , γ pq ) denote the density function of D i j , conditionally on i ∈ R p and j ∈ C q . We then have is the vector of unknown parameters, and where . . , γ pQ ).

Conditional Classification Likelihood
In this section we will introduce some notation to shorten and simplify the interpretation of the equations. For the convenience of the reader the introduced notation is listed in Table 1.
For estimation of the parameters of interest, E-ReMI makes use of a combination of two strategies, namely, conditional likelihood and classification likelihood. We first discuss the conditional likelihood approach.
Conditional likelihood is a well-known approach in modern psychometrics (Andersen, 1973;Fischer & Molenaar, 1995) and biostatistics (Anderson & Senthilselvan, 1980) that involves the elimination of so-called nuisance parameters from the likelihood function. The nuisance parameters may pertain to fixed or random effects ( (Verbeke & Molenberghs, 2000), sec.13.5). If the nuisance parameters pertain to random effects, the main advantage of this approach is that no distributional assumption is needed with respect to those random effects (Verbeke et al., 2001). Here, we treat the random effects α i in (2) as nuisance parameters and, rather than maximizing the joint likelihood of all random variables, estimation of the parameters of interest is achieved by maximizing the conditional likelihood given the sufficient statistics (d i· 's) for these nuisance parameters (α i 's). We only assume that the latter are i.i.d. with finite variance and zero expectation (the latter for identifiability of μ in the model). This leads us to formulate the following theorem.

Theorem For the I independent joint realizations
showing that the joint density of the data matrix d and the row partition z can be factored such that it satisfies Fisher's factorization theorem (Fisher, 1922;Neyman, 1935;Rice, 2007), where α = (α 1 , . . . , α I ) is the vector of the random row main effects and z = (z 1 , . . . , z I ) T is the realized vector of the latent random row cluster membership indicator vectors. For our purposes, we rewrite (3) as: Note that W and V do not depend on the data nor on the main and interaction effect parameters. We then add and subtractd i· = 1 J J j=1 d i j into the last term of our expression and expand the square, obtaining where are the square terms, and U is a sum of cross product terms that can be written as We show how to get this result in Appendix A. Under the set of identifiability constraints Q q=1 |C q |(−γ pq ) = 0 ( p = 1, . . . , P), it follows that U = 0. Therefore, (4) reduces to the simplified The joint density in (6) satisfies Fisher's factorization theorem since W and V do not depend on the observed data, A is a function of μ+α i and depends on the observed data only through the statisticsd i· , while B is not a function of μ + α i .
Estimation of the vector of unknown parameters ξ = (φ, θ , K ), conditional on the sufficient statisticsd i· , implies maximizing the conditional likelihood obtained from (6) by leaving out the first factor. Specifically, is maximized with respect to ξ and the unobserved row cluster labels z 1 , . . . , z I . Since the latter are treated as parameters to be estimated along with ξ , this is referred to as a classification likelihood approach (Scott & Symons, 1971;Symons, 1981;Bock, 1996;Govaert & Nadif, 2013). No restrictions are put on z (resp. K) other than that P p=1 After applying a logarithmic transformation to (7), the equation for the conditional classification log-likelihood becomes: The unknown parameter σ 2 is involved in the kernel of this equation (and in V ) and, thus, has to be accounted for in the maximization. We replace it with its maximizer, obtained by partial differentiation of the equation above. This yields the following result After plugging this expression in the equation above, and after some simplifications, we obtain the following criterion to maximize (see Appendix B): where H = I J 2 log I J 2π − 1 is an additive constant. This criterion is fully written as where ω p ( p = 1, . . . , P), β j ( j = 1, . . . , J ), γ pq ( p = 1, . . . , P, q = 1, . . . , Q), z i p (i = 1, . . . , I , p = 1, . . . , P) and k jq ( j = 1, . . . , J , q = 1, . . . , Q) are to be estimated, subject to the constraints given in this section. This is a penalized classification likelihood criterion (Bryant, 1991;Celeux & Govaert, 1992), where the first term is a penalty term that penalizes row partitions that imply higher levels of unpredictability (Shannon, 1948), that is, solutions with more balanced cluster sizes.

Model fitting
Maximizing (9) with respect to the unknown parameters z, ω, β, γ , K is a mixed continuous-combinatorial optimization problem because B and W involve the unknown cluster membership indicators z i p and k jq on top of the unknown row cluster sizes ω p and the fixed effects β j and γ pq . For this optimization problem, there are currently no routines available in statistical software. We therefore developed an iterative procedure that is presented in this subsection. To explain this iterative procedure, we will first discuss parameter estimation of ω p , β j , and γ pq for a given two-mode partition R×C, that is, for given matrices z and K. Subsequently, we discuss how to maximize the criterion given in (9) with respect to the two-mode partition R × C.

Estimation of two-mode partition
In the previous subsection, we discussed parameter estimation of ω p , β j and γ pq , given an arbitrary two-mode partition R × C. The challenge of finding the best fitting E-ReMI model for a data set at hand is completed by addressing the estimation of z and K.
Based on the estimates of ω p , β j and γ pq , finding the best fitting configuration (i.e, the configuration that maximizes (ξ )), given P and Q, comes down to maximizing the following clustering criterion 1 : with respect to z and K. This is a combinatorial optimization problem, for which it is not feasible to apply a procedure that guarantees finding the global maximizer. Instead, we use a greedy optimization algorithm that starts from some initial configuration (z 0 , K 0 ) and deterministically searches through the solution space as long as neighbouring configurations with a better likelihood value can be found. In order to explain the details of this estimation algorithm, it is useful to rewrite the optimization problem as follows: . is a transformation (double centering, see Table 1) of the observed data that can be computed once, before the start of the algorithm. The algorithm generates increasing values of the (log)likelihood by iterating between obtaining updated estimates of the row and column cluster membership indicators and the interaction effect parameters, respectively. It is described in more detail in Algorithm 1. In order to increase the probability of finding the global maximum, one may run the algorithm M times using independently generated random initial configurations (z 0 , K 0 ) and retain the configuration that yields the highest value of (10). Note that the assumption of equal row cluster sizes can be imposed by settingω p = 1 P . This implies that the term P p=1 I i=1 z i p log( I i=1 z i p I ) in (10) reduces to the constant −I log(P). Let (CC) * denote this constrained criterion. Maximizing (CC) * is equivalent to fitting a REMAXINT model (Ahmed et al., 2021) to the data at hand.

Hypothesis testing
In this section we are concerned with testing the null hypothesis that all interaction effect parameters in (1) are equal to zero. Formally, H 0 : γ pq = 0 (for p = 1, . . . , P, q = 1, . . . , Q) for a fixed number of row and column clusters P and Q, respectively. We propose a likelihood ratio test for testing this null hypothesis. This implies obtaining the conditional likelihood of the data under the null hypothesis. In the next subsection, we first obtain this expression and then we propose a new test statistic based on the ratio of two conditional likelihoods, that is, the conditional likelihoods under the alternative and null hypotheses.

Test statistic
Under the null hypothesis, all interaction effect parameters in (1) are equal to zero (i.e., γ pq = 0, p = 1, . . . , P, q = 1, . . . , Q). This implies that (1) reduces to for i = 1, . . . , I and j = 1, . . . , J . Note that (11) implies that there is no partitioning in the data generating mechanism assumed by the reduced model. In order to obtain the conditional likelihood expression for this model, we start by denoting the reduced parameter Algorithm 1 Algorithm to find E-ReMI local maximum.
Input: d i j (i = 1, . . . , I , j = 1, . . . , J ), P, Q Output: z, K (Estimated row and column cluster memberships) Randomly generate a starting configuration (z n , K n ) Computeγ n pq for that configuration CC (0) ← the value of (10) for this initial configuration ζ ← Any arbitrary value larger than 0 while ζ > 0 do Keep K n fixed and find update z n+1 as follows: for (i = 1, . . . , I ) do if i not in a singleton cluster then for ( p = 1, . . . , P) do define candidate z by setting z i p ← 1 and z i p * ← 0( p * = p) updateγ pq and evaluate (10) end for choose candidate z that maximizes (10) as z n+1 end if end for Keep z n+1 fixed and find update K n+1 as follows: if j not in a singleton cluster then for (q = 1, . . . , Q) do define candidate K by setting k jq ← 1 and k jq * ← 0(q * = q) end for end if choose candidate K that maximizes (10) as K n+1 end for computeγ n+1 pq based on z n+1 and K n+1 compute value of (10) for z n+1 , K n+1 andγ n+1 pq CC (n+1) ← the value of (10) for the current configuration; ζ ← CC (n+1) − CC (n) ; n ← n + 1 end while return (z, K) end function vector as ξ 0 = φ = (μ, β 1 , . . . , β J , σ 2 ), that does not include any parameters that pertain to a bipartition R × C (since there are no clusters). The density (conditional on α) of the entire sample, can be written as: We next factorize this joint density by applying the same steps as in Section 2.2 and finally obtain the conditional log-likelihood for the model under the null hypothesis: where V and B are defined as in Section 2.2 but with γ pq = 0 ( p = 1, . . . , P, q = 1, . . . , Q). Replacing σ 2 by its maximizer yields, after some simplifications, the following criterion to maximize w.r.t. the β j 's It can be shown that the conditional m.l. estimate of β j isβ j =d . j −d .. . We propose the following test statistic to test the null hypothesis of no interaction: where (ξ ) and (ξ 0 ) are the logarithms of the maximized conditional likelihood functions for the alternative and null model, respectively. After some simplifications, this yields the following form of the test statistic: where in practice z ik and k jq are replaced by their estimated valuesẑ ik andk jq as returned by Algorithm 1. Note that the difference between the logarithms in the parentheses (second line) is a difference between log-squared-residuals that is necessarily non-negative and is expected to grow if I and/or J is increased. The penalty term P p=1 I i=1 z i p logω p = P p=1 |R p | logω p is necessarily negative (for P ≥ 2), does not depend on J , and becomes smaller (i.e., larger in absolute value) if I increases.
It is worth noting that if one imposes an assumption of equal row cluster sizes by settinĝ ω p = 1 P , this penalty term becomes a constant E = −I log(P). Let this constrained version of the test statistic be denoted as λ * L R : Note that, for a given data set, the numerator of the term in the logarithm is a constant. Therefore, (13) is maximized by minimizing the denominator of the term in the logarithm. Furthermore, minimizing that term is equivalent to maximizing This implies that the maximizer of (13) is the maximizer of which is the test statistic used to test for interaction in REMAXINT (Ahmed et al., 2021). We therefore refer to a test based on (13) as REMAXINT test.
It is important to emphasize that z and K are not known, but must be estimated from the data by maximizing (9). As a result, the sampling distribution of λ L R is not known and must be obtained by Monte Carlo simulations, just like related statistics defined on clustering approaches (Ahmed et al., 2021;Bock, 1996). We will elaborate on the Monte Carlo procedure in the next subsection, but first it is useful to discuss some properties of λ L R .
Property 1 The distribution of λ L R under H 0 and H 1 does not depend on the value of the unknown residual variance σ 2 .
Proof Consider the following transformation of the data: d i j = m · d i j , that is, multiplication by a constant factor m. This transformation implies the residual variance of the transformed data to be σ 2 = m 2 · σ 2 . This transformation further implies (dc) i j = m · (dc) i j and γ pq = m ·γ pq . Since all terms within the squares of λ L R are multiplied by this factor m, the arguments of the logarithms in the second and third term of (12) are multiplied by m 2 and cancel out of the equation.

Property 2 The distribution of λ L R under H 0 and H 1 does not depend on the values of the unknown parameters μ, α
Proof This follows from (dc) i j being only a function of γ pq and i j , which has been shown in Ahmed et al. (2021).

Computational Procedure
In order to test the null hypothesis, it is possible to draw from the true null distribution of λ L R rather than using a bootstrap approach (e.g. (McLachlan & Peel, 1997;Hennig & Lin, 2015)). Specifically, in order to obtain the sampling distribution of the test statistic λ L R under the null hypothesis of no interaction, we propose a three steps Monte Carlo scheme (for fixed numbers of row and column clusters P and Q, respectively), , which is the model under the null hypothesis of no interaction. We discuss briefly how to set these parameters below.

Simulation Study
In this section, we report an evaluation of the proposed methodology in terms of several criteria. In the following subsections, we first present the design of the simulation studies and then discuss the results, focusing on Type-I error rate and power of the likelihood ratio test of interaction performed through λ L R . Subsequently, we focus on evaluating E-ReMI estimates in terms of parameter recovery.

Design
In this subsection we discuss the design of three simulation studies. The first study is used to establish critical values of the test statistic λ L R as a function of two completely crossed experimental factors: size (I × J ) of the data set and number of clusters (P * , Q * ) as assumed in the data analysis. The critical values are found following the three steps in the computational procedure described in Section 3.2, with L = 5000 and M = 20 number of random starts. The second simulation study investigates to what extent these critical values are subject to sampling errors, since the sampling distribution is generated based on a finite L. The third simulation study is to assess the power of the test statistic λ L R to detect row by column interaction. In the first two simulation studies the design factors size of the data and number of clusters were varied across a range of values: In the third simulation study, there are two additional design factors, which are, the true number of clusters used for data generation (P, Q) and equality of expected row cluster sizes. In this simulation study the design factors are not fully crossed.

Critical Values
In order to determine critical values of the proposed test statistic λ L R we applied the three steps Monte Carlo scheme described in Section 3.2. Specifically, we generated L = 5000 independent data sets without any row by column interaction for each level of the design factor size (I × J ) of the data. That is, data were generated from the null model such that D , σ 2(sim) ) . Without loss of generality, we set σ 2(sim) = 1 and, On each generated data set, we applied E-ReMI for each level of number of clusters assumed in the analysis. For any combination of size and number of clusters this yields L = 5000 simulated Monte Carlo test statistic values λ L R = {λ (l) L R ; l = 1, . . . , L}, whose distribution approaches the sampling distribution of λ L R under the null hypothesis, if L is sufficiently large (Efron, 1982;Chernick, 2011). From this simulated empirical distribution of λ L R we then obtain critical values λ (α) L R for any significance level α by finding its 100(1−α)th quantile.

Type-I Error Rate
In a second simulation study we investigate to which extent sampling error affects establishing critical values if L = 5000. Specifically, for each level of size of the data, we generated a new set of 5000 independent data sets from the null model such that where μ ∼ U(0, 1), α i ∼ N (0, 1) and β j ∼ N (0, 1) rather than set all of them to 0 as in the first simulation study. Furthermore, given Property 1 of the test statistic λ L R (see Section 3.1), we set the error variance arbitrarily at σ 2 = 7.
Each data set was then analyzed by applying E-ReMI for each level of number of clusters. Every single analysis yields an observed value of the test statistic λ (obs) L R , which may be compared to the critical value for that combination of size and number of clusters as was obtained in the first simulation study in Section 4.
L R the decision is to reject the null hypothesis (i.e., no row by column interaction) in favour of the alternative hypothesis that there is some interaction. For each combination of size and number of clusters, the proportion of λ (obs) L R values (out of all 5000 data sets for that combination of size and number of clusters) that fall in the rejection region corresponds to the empirical Type-I error rate of the E-ReMI interaction test. If this study yields empirical Type-I error rates close to the nominal level α = 0.05, we may conclude that choosing L = 5000 is sufficient for accurately establishing critical values.
To generate a data set of size I × J , with true number of clusters equal to (P, Q), and with unequal expected row cluster sizes, we used the following procedure. First, randomly generate row and column partition matrices z and K. Specifically, an I × P row partition matrix z with unequal row cluster sizes (in expectation) is generated by randomly assigning each row to a row cluster R p with probability ω p , where ω 1 = 0.7 and ω p = (1 − ω 1 )/(P − 1), (∀ p = 1). This means one may expect one large row cluster that includes 70% of the rows, while the remaining 30% of the rows are distributed evenly across the remaining row clusters. Since this is true only in expectation, it is possible that this step yields empty row clusters, in which case it is repeated until a z without empty clusters is generated. We further study the case of equal row cluster sizes (in expectation), where the I × P row partition matrix z is generated by randomly assigning each row to a row cluster R p with probability ω p = 1/P, (∀ p), and this step is, if necessary, repeated until a z without empty clusters is generated. Likewise, under each scenario, a J × Q column partition matrix K is generated by randomly assigning each column to a column cluster C q with probability 1/Q (if necessary repeated until a K without empty clusters is obtained).             Fig. 6 Mean N SE as a function of method (x-axis) and size of the data (curves), for data generated with unequal expected row cluster sizes. Column subfigures refer to true number of clusters set to (2, 2), (3, 3) and (4, 4), for the first, second and third column, respectively. Subfigures in the first row are obtained when the number of clusters for the analysis coincides with the true number of clusters (i.e., no misspecification), while the second and third row correspond to misspecification of the number of clusters for the analysis (see subfigure headings for details) Next, each level of true number of clusters, we constructed a fixed P × Q matrix of true interaction parameter values γ pq as where I ω p ( p = 1, . . . , P) denotes the expected cluster cardinality of row cluster R p ( p = 1, . . . , P) and |C q | (q = 1, . . . , Q) denotes the cluster cardinality of column cluster C q (q = 1, . . . , Q) as implied by the generated column partition matrix K. Therefore, each of these s guarantees that the constraints are met, that is Subsequently, an I × J matrix T, with elements t i j (i = 1, . . . , I , j = 1, . . . , J ), of true interaction effects is obtained as T = z K T and each element d i j of the observed data matrix D is generated as D i j ∼ N (t i j , σ 2 ). The value of σ 2 is chosen to obtain a specific effect size η 2 that is defined as the ratio of interaction variance to the sum of interaction variance and error variance. Specifically, which was set to η 2 = 0.10. For each combination of size, equality of expected row cluster sizes and true number of clusters we generated 1000 simulated data sets. Each data set D was then analyzed three times using the newly proposed E-ReMI method, each of which setting the number of clusters for the analysis equal to a specific value of (P * , Q * ) (for details, see Figs. 1-6). This yielded for each data set and each value of (P * , Q * ) an observed value of the test statistic λ (obs) L R , which, as in the second simulation study, was compared to the corresponding critical value λ (α) L R for that combination of size and number of clusters (P * , Q * ). The proportion of observed values (out of all 1000 data sets for that combination) that fall in the rejection region corresponds to the empirical power of the E-ReMI test of interaction.
Furthermore, to study cluster recovery performance of E-ReMI, we examined the extent to which the optimal partitions, as obtained from the data analyses, resemble the true partitions underlying the data. Specifically, we measured the agreement between the true underlying partition of the set of rows (z) and the estimated row partition (ẑ) making use of the adjusted Rand index (AR I ), see (Hubert & Arabie (1985)). This index is 1 if the two partitions are identical and 0 if the two partitions do not correspond more than expected by chance, and its minimal value can be smaller than 0 (Chacón & Rastrojo, 2022). Furthermore, the index is insensitive to permutations of the cluster labels. We measured the agreement between the true and estimated column partitions (i.e., K andK) in the same way. Finally, we examined the extent to which the true interaction parameters are recovered by computing the normalized squared error (N SE): Note that N SE is a decreasing function of both cluster recovery and the extent to which the estimated interaction effect parametersγ pq resemble the true interaction parameters γ pq . Larger values of N SE indicate stronger disagreement. For the purpose of method comparison, each data set was also fitted by a model that assumes equal expected row cluster sizes. This was achieved by using the constrained criterion (CC) * , as explained in Section 2.3. From a modeling perspective, this is equivalent to fitting the REMAXINT model discussed in Ahmed et al. (2021). In order to test the hypothesis of no interaction based on this model, we used the constrained test statistic λ * L R , as explained in Section 3.1. Furthermore, we tested each simulated data set for the presence of interaction by the methods developed in Boik (1993); Malik et al. (2016); Piepho (1994), using the released R-package CombinIT (Shenaravi & Kharrati-Kopaei, 2018). This package includes various tests for the presence of interaction in two-mode data. We selected those that are (computationally) able to handle the data sizes included in this simulation study. Boik is based on a test statistic that involves the singular values of the (scaled) nonadditivity matrix (i.e., the matrix of least squares residuals from fitting a two-way additive model to D). Malik partitions that set of residuals (non-additivity values) into three clusters and is based on the idea that in the case of interaction one may expect to see at least one cluster of residuals that are positive, one cluster of residuals that are negative, and possibly one cluster with residuals close to zero. The test statistic is an F-like test statistic for a two-way model that includes row and column main effects and a cluster effect. Piepho tests for interaction by checking equality of variances between rows. This is based on the fact that if there is no interaction, then the expected row variances are all equal. For further details the reader is referred to the reference manual of combinIT on the CRAN website and Shenaravi & Kharrati-Kopaei (2018).
We study empirical power for E-REMI, REMAXINT, Boik, Malik and Piepho. For the estimated solutions obtained by REMAXINT, we also computed the three recovery measures (i.e., AR I for rows, AR I for columns and N SE) and compared these results to those obtained by E-ReMI. A similar comparison between E-ReMI and Boik, Malik and Piepho, respectively, is not possible because these methods are not based on an underlying two-mode clustering model and CombinIT only focuses on hypothesis testing for interaction.  L R for each combination of size of the data and number of clusters, at a nominal significance level α= 0.05. These critical values were obtained by applying the three steps Monte Carlo scheme described in Section 3.2.

Critical Values
Inspection of Table 2 shows that, for each level of size, the null distribution of λ L R is shifted towards the right for increasing number of column clusters. This is an expected result because more parameters are estimated from the observed data and, thus, more chance capitalization resulting in higher λ L R values by chance. In contrast, increasing the number of row clusters appears to result in null distributions that are shifted towards the left, despite implying a larger number of estimated parameters. This shift must be attributed to the penalty term in (12), which is affected by the number of estimated row clusters. Furthermore, if the number of columns J is fixed, then for each level of number of clusters, the null distribution of λ L R is shifted to the left as the number of rows I increases. This is due to the penalty term decreasing faster, as I increases, than the difference in log-squared-residuals (see (12)), leading to smaller test statistic values. Finally, comparing 50 × 30 to 20 × 20 and 40 × 20 shows that increasing the number of columns J shifts the null distribution of λ L R towards the right, despite a larger number of rows I , which we have seen shifts the null distribution to the left. This may be explained by the fact that J does not affect the penalty term in (12) but only the difference in log-squared-residuals. Table 3 shows the proportion of significant test results for all combinations of size and number of clusters. Inspection of this table reveals that the proportion of significant test results for each design is close to the nominal significance level α = 0.05. Specifically, only one of the empirical Type-I error rates is outside the interval [0.0403; 0.0597], which is the interval centered at the nominal significance level (i.e., 0.05) and with radius equal to the (normal approximation of the) margin of error for a population proportion of 0.05, with L = 5000 trials, and corrected for the number of tests, i.e. cells in the table, using Bonferroni for familywise error rate of 5%. The accuracy of the empirical Type-I error rates suggests that generating L = 5000 Monte Carlo data sets using the three steps Monte Carlo scheme described in Section 3.2 is a reasonable choice for approximating the null distribution of λ L R .

Power and Parameter Recovery
Power Figure 1 shows empirical power as a function of method (x-axis) and size of the data (curves), for data generated with unequal expected row cluster sizes. Subfigures in each column refer to true number of clusters set to (2, 2), (3, 3) and (4, 4), for the first, second and third column, respectively. Subfigures in the first row are obtained when the number of clusters for the analysis coincides with the true number of clusters (i.e., no misspecification), while the second and third row correspond to misspecification of the number of clusters for the analysis (see subfigure headings for details). Note that the methods Boik, Piepho and Malik are not based on a two-mode clustering model, and hence their power depends only on the generated data and not on the number of clusters for the analysis. The results of these methods therefore do not change across the different rows of Fig. 1, but only across columns. Overall, empirical power decreases if the number of observations (i.e., I and/or J ) decreases (comparison between curves within a subfigure) and/or the true number of clusters (P, Q) increases (comparison across subfigures of the first row). There is a decrease, but it is not dramatic, in power for E-ReMI and REMAXINT when the number of clusters for the analysis does not match the true number of clusters (comparison across rows of the subfigures within each column). This suggests that, when using E-ReMI or REMAXINT as a test for interaction a small under/over-fitting does not have serious consequences for power (see Section 6 for a discussion on setting the number of clusters for the analysis). Comparing the different methods, it stands out that Malik performs the worst, followed by Piepho, which shows the second worst performance in terms of power. Comparing the remaining three methods when the true number of clusters is set to (2, 2), REMAXINT seems to perform slightly better than E-ReMI, which, in turn, tends to perform better than Boik. Instead, when the true number of clusters is set to (3, 3) E-ReMI overall has the best performance, followed by REMAXINT and then Boik. Given the choice of data generation mechanism (with one row cluster comprising 70% of cases in expectation), this is not surprising as increasing the number of row clusters leads to a higher level of inequality of the expected row cluster sizes. Lastly, Boik becomes the best method, followed by E-ReMI, once the true number of clusters increases to (4, 4). Figure 2 is similar to Fig. 1 and shows empirical power as a function of method (x-axis) and size of the data (curves), but for data generated with equal expected row cluster sizes (same subfigure structure). Similarly as in the previous scenario, empirical power decreases if the number of observations decreases (comparison between curves within a subfigure) and/or the true number of clusters (P, Q) increases (comparison across subfigures of the first row). However in this case, it seems that increasing the number of columns has a stronger effect than increasing the number of rows, as, when size is set to 30 × 50 results are equal or better than when size is set to 50 × 30. A possible explanation is that since E-ReMI is too flexible, it requires more columns so as not to overfit the data with respect to the row clusters. REMAXINT, on the other hand, correctly assumes equal row cluster sizes and thus is less affected by the number of columns. As in the unequal row cluster sizes case, there is a small decrease in power for E-ReMI and REMAXINT when the number of clusters for the analysis does not match the true number of clusters (comparison across rows of the subfigures within each column), with E-ReMI being less robust to this misspecification. Comparing the different methods, Malik and Piepho have clearly the worst performance. Comparing the remaining three methods reveals that when the true number of clusters is set to (2, 2) and to (3, 3) REMAXINT seems to perform better than E-ReMI and Boik, which have a similar performance. When increasing the true number of clusters to (4, 4) Boik is clearly the best performing method in terms of power.
Parameter recovery Figures 3-6 present the results in terms of means, across all 1000 data sets per condition, of AR I for rows, AR I for columns and N SE. Similarly to the figures for power, different columns refer to different true number of clusters, while different rows to different number of clusters for the analysis (see subfigure headings for details). For studying parameter recovery, the comparison is possible only between E-ReMI and REMAXINT, as they are the only two methods that yield estimated row and column partitions with corresponding bicluster interaction effect parameters. Figure 3 presents the means across all 1000 data sets per condition of AR I for row clusters as a function of method (x-axis) and size of the data (curves), for data generated with unequal expected row cluster sizes. Overall, mean AR I is higher (i.e., better performance) for larger data sizes. Specifically, it is the highest when size of the data is set to 200 × 30, followed by 30 × 50 and then by 50 × 30. Moreover, for fixed number of columns and increasing number of rows (i.e., comparing 100 × 20, 40 × 20 and 20 × 20), the performance increases as the number of rows increases. Lastly, comparing the cases 100 × 20, 50 × 30 and 30 × 50, the latter has the overall best performance, while 100 × 20 has the worst, despite having a larger number of total observations (i.e. 2000 as compared to 1500). This is as expected, since a larger number of columns implies more information to estimate the row clusters. Subfigures in the top row show the results when there is no missspecification of the number of clusters for the analysis, that is, when they coincide with the true number of clusters. Focusing on these figures, it can be seen that the two methods perform equally well in the scenario with true number of clusters set to (2, 2), while E-ReMI performs better in the (3, 3) and (4, 4) cases. In case of misspecification of the number of clusters for the analysis, E-ReMI performs always better than REMAXINT. This result is particularly interesting in the (2, 2) case, as the performance between the two methods was similar under correct specification of the number of clusters. This increased comparative performance can be explained by the fact that the misspecified models in this case imply an overfitting of the number of row clusters. It is likely that E-ReMI is capable of classifying correctly most of the observations, by creating additional small clusters for observations that (randomly) differentiate from the two main clusters. REMAXINT, on the other hand, because of the implicit assumption of equal row cluster sizes, is encouraged more strongly to yield surplus row clusters containing a substantial number of rows. Figure 4 presents the means of AR I for row clusters, for data generated with equal expected row cluster sizes. Overall, we see a slightly better performance of REMAXINT in all scenarios but those where the true number of clusters is equal to (2, 2) and the number of clusters for the analysis are misspecified, where E-ReMI has clearly a better performance. These results can again be explained by the flexibility of E-ReMI with respect to yielding row clusters with unequal row cluster sizes that may (partially) make up for the fact that an excess number of row clusters is fitted to these data. Figure 5 presents the means of AR I for column clusters, for data generated with unequal expected row cluster sizes. Overall, mean AR I is higher (i.e., better performance) for larger data sizes. Specifically, it is the highest when size of the data is set to 200 × 30, followed by 100 × 20 and then by 50 × 30. As opposed to AR I for rows, increasing the number of rows has a stronger effect on AR I for columns than increasing the number of columns. This can be seen by 50 × 30 performing better than 30 × 50 (it was the opposite in ARI for rows) and by 100 × 20 performing better than those two (which was not the case for ARI for rows). This is as expected, since a larger number of rows imply more information to estimate the column clusters. The performance of the methods decreases for higher values of true number of clusters (comparison across columns), and it is negatively affected by misspecification of the number of clusters for the analysis when true number of clusters is set to (3, 3) and (4, 4) (comparison across rows of the middle and rightmost columns). Note that when true number of clusters is set to (2, 2), both misspecifications imply overfitting in terms of the number of row clusters, whereas when true number of clusters is set to (4, 4) both misspecifications imply underfitting in terms of the number of column clusters. When true number of clusters is set to (2, 2) the two methods perform equally well, while when it is set to (3, 3) or (4, 4), the two methods perform equally well in most cases, except for data sizes with a large number of rows in which E-ReMI has a better performance. This suggests that estimation of the column partitions benefits substantially from a correct specification of the model if that sample is sufficiently large (i.e., 100 rows or more). Results when data are generated with equal row cluster sizes are very similar, but now the two methods perform always equally well (results not shown). Note that in this case, the sampling mechanism for the rows as assumed by REMAXINT is correct. Figure 6 presents the means of N SE, for data generated with unequal expected row cluster sizes. The outcome measure N SE is the most general parameter recovery performance measure out of the three considered in this study, since it takes into account the quality of the estimated row clustering, of the estimated column clustering and of the estimated interaction effect parameters. Bearing in mind that lower values of N SE imply better performance, mean N SE is the lowest (and thus the best) for size set to 200 × 30, i.e. the largest data size, and is very similar for sizes set to 100 × 20, 50 × 30 and 30 × 50. Since 100 × 20 has a larger data size than the other two cases, but it is also the most asymmetrical case, this suggests that more symmetrical data sets tend to perform better in terms of N SE. Also in this case, increasing the true number of clusters has a detrimental effect on the performance of the methods under study. Interestingly, misspecification of the number of clusters for the analysis does not lead to a clear trend with results that are sometimes not affected, sometimes positively affected (better performance of the methods) and sometimes negatively affected (worse performance of the methods). Lastly, E-ReMI tends to perform at least as well as REMAXINT, and, very often, better. Results when data are generated with equal row cluster sizes are very similar, but the two methods perform always equally well (results not shown).
Summarizing the results in this subsection, increasing size of the data leads to a better performance, with some performance criteria more affected by an increase in the number of rows and some others by an increase in the number of columns. Increasing the complexity of the clustering structure, that is, when true number of clusters increases, leads to a worse performance. Misspecification of the number of clusters for the analysis generally has a detrimental effect on the performance of the methods, but for most performance criteria (in most scenarios) this effect is minimal. As could be expected, in terms of power, E-ReMI tends to perform better than REMAXINT when data are generated with unequal expected row cluster sizes whereas, in a few cases, the performance of REMAXINT is better than that of E-ReMI when data are generated with equal expected row cluster sizes. Overall, in terms of parameter recovery, E-ReMI performs better than REMAXINT when data are generated with unequal expected row cluster sizes and both methods perform equally well when data are generated with equal expected row cluster sizes. Lastly, when it comes to power, Boik's method is a good choice as its performance is always very good and it is more robust to increased complexity of the data. However, this method was not designed to facilitate interpreting the row by column interaction in a data set at hand. When interest is in understanding that interaction, REMAXINT/E-ReMI are recommended because they yield an estimated two-mode clustering structure and corresponding interaction effect parameter estimates.

Application to a Study of Altruism Behavior
In this section, we analyze data from a real case study on altruism in order to infer whether there is statistical evidence of interaction and to study what it looks like. The application stems from a study of person by situation interaction, one of the key questions addressed by researchers in contextualized personality psychology (Geiser et al. 2015;Mischel & Shoda 1995, 1998. The data in question were collected in a study by Quintiens (1999) and were more recently reanalyzed in Schepers & Van Mechelen (2011), Schepers et al. (2017 and Ahmed et al. (2021). A group of I = 102 participants was presented with a set of J = 16 hypothetical situations, each describing an emergency situation in which a victim could possibly be helped. Each participant was asked to indicate, for each situation, to what degree they would be willing to help the victim. Ratings were given on a 7-point scale from 1 (definitely not) to 7 (definitely yes).
In order to infer whether there is evidence of interaction between person and situation clusters, E-ReMI and REMAXINT interaction tests were both applied to the 102 × 16 data matrix of help ratings. For REMAXINT, this implied analyzing the data set at hand by maximizing constrained criterion (CC) * , see Section 2.3, and testing the hypothesis based on the constrained test statistic λ * L R , see Section 3.1. For both methods, we analyzed the data assuming (P, Q) = (3, 3), since Ahmed et al. (2021) suggested this choice for REMAXINT, using a post-hoc analysis. In order to obtain critical values for λ L R and λ * L R , we employed the procedure described in Section 3.2 for a significance level α = 0.05, and using Algorithm 1. Each analysis yields an observed value λ  Table 4. For both methods, the observed value of the test statistic falls in the rejection region. In fact, for both tests, the observed test statistic value is larger than any of the simulated values obtained under the null hypothesis, implying empirical p-values that are equal to zero. Thus, based on the E-ReMI and REMAXINT interaction tests assuming (P, Q) = (3, 3), there is strong empirical evidence to conclude that there is an interaction between person cluster and situation cluster on willingness to help.
We now turn to studying what the gist of the interaction in these data looks like. For this purpose, Fig. 7 shows bicluster means associated to the estimated person and situation partitions yielded by E-ReMI and REMAXINT (middle and right panel, respectively), assuming P = 3 person clusters and Q = 3 situation clusters. Furthermore, this figure also shows the bicluster means associated to the estimated person and situation partitions yielded by a Gaussian latent block mixture model (Govaert & Nadif, 2003) assuming the same number of person and situation clusters (left panel). The block mixture model parameter estimates can be seen as a summary of the data. Hence, if there is substantial interaction in some data set, one may expect to see it exhibited in a suitable summary of those data. The latent block mixture model was estimated using the R package blockcluster (Bhatia et al., 2017) allowing unequal person and situation cluster sizes and assuming homogeneous residual variance across biclusters.
Compared to the latent block mixture model, REMAXINT and E-ReMI yield person and situation clusters that represent a stronger degree of person cluster by situation cluster interaction. This difference implies a substantially different type of conclusion. Notably, for the latent block mixture model, a ranking of the person clusters in terms of average willingness to help is consistent across all three situation clusters. This implies that members of one person cluster are on average more willing to help than members of another person cluster, regardless of the situation. In contrast, for REMAXINT and E-ReMI that ranking of the person clusters depends on the situation cluster. Specifically, the latter two methods yield person clusters with members that, in some situations, are on average more willing to help than members of another person cluster, but not in other situations. This difference between, on the one hand, the solution yielded by the latent block mixture model and, on the other the hand, the solutions yielded by REMAXINT and E-ReMI is due to the presence of individual person and situation main effects in these data. The latent block mixture model yields person and situation clusters that capture person by situation interaction as well as person and situation main effects. Remember that the latent block mixture model does not allow for row differences within a given row cluster, and likewise not for column differences within a given column cluster. Stated differently, it assumes stochastic independence between observations within the same row and column cluster. In contrast, REMAXINT and E-ReMI assume local independence given R and C and the individual person and situation main effects, implying that person and situation clusters as yielded by these methods are not affected by those main effects. It is important to note that this does not imply that the REMAXINT and E-ReMI person clusters (resp. situation clusters) are necessarily independent of the person (resp. situation) main effects in the data. For instance, it can be seen in Fig. 7 that the situation clusters yielded by REMAXINT differ in terms of average willingness to help. The same applies to the situation clusters yielded by E-ReMI. However, this association simply happens to be a characteristic of these data and may be (almost) absent in other applications.
Choosing between a latent block mixture approach or a maximal interaction clustering approach must be based on the type of question that one wishes to address with the chosen method. A latent block mixture approach will yield parameter estimates that are a good summary of the data, and will thus yield row and column partitions that also capture row and column main effects. A maximal interaction clustering approach is more useful if one wishes to describe the gist of the row by column interaction. As the nature of person by situation interaction is of central interest in contextualized personality psychology, and not so much the possible person and situation main effects, one may argue that a maximal interaction clustering approach is the more interesting choice for this application. Using a latent block mixture approach may, compared to a maximal interaction clustering approach, require the extraction of much more row and column clusters to capture the observed correlations between columns and rows, respectively. Furthermore, compared to E-ReMI, REMAXINT tends towards yielding person clusters that are more equal in size. According to E-ReMI, there is one large cluster and two much smaller ones whereas the relative cluster cardinalities of the row clusters yielded by REMAXINT are closer to each other. A likelihood ratio test suggests that E-ReMI provides a better fit to these data than REMAXINT (−2( (ξ 0 )− (ξ )) = 43.784, d f = 2, p = 0.000). Note that the person cluster by situation cluster interaction captured by E-ReMI is structurally different from the one captured by REMAXINT. For REMAXINT, a ranking of the situation clusters in terms of average willingness to help is equal across all three person clusters. This implies that, consistently across person clusters, situations of one situation cluster elicit on average more willingness to help than situations of another situation cluster. In contrast, for E-ReMI this ranking of the situation clusters depends on the person cluster: Situations of situation cluster 2 elicit on average more willingness to help than situations of situation cluster 1 for members of two of the three person clusters, but the opposite holds for members of the other person cluster. In contextualized personality psychology, this difference in structure is of theoretical importance, as the consistent ranking of situation clusters as yielded by REMAX-INT suggests the possible existence of a latent (one-dimensional) force that stems from the situations and that plays a key role in determining the behavior of all persons (Van Mechelen, 2009). In this application, an example of that force could be the level of compassion as induced by a situation. However, since E-ReMI yields a structure that does not show a consistent ranking of the situation clusters, and fits the altruism data better than REMAXINT, it appears that willingness to help cannot be explained by such a one-dimensional underlying situational force.

Discussion
In this paper, we presented E-ReMI, a method based on a probabilistic two-mode clustering model that yields two-mode partitions of the data with maximal interaction between row and column clusters. Specifically, this paper extends existing work (Ahmed et al., 2021), in several ways. First, in the specification of the model, we relaxed the assumption of equal cluster size on the random rows, thus allowing for unequal cluster size of the row clusters. Moreover, we introduced a new testing procedure for the null hypothesis of no interaction. This includes a test statistic, its properties and an algorithm to obtain the distribution of the test statistic under the null hypothesis. Finally, we developed software for all the methods presented in this paper (i.e., estimating E-ReMI and REMAXINT models and Monte Carlo sampling for hypothesis testing). In order to make these methods available to a large group of users, we implemented our codes in the free software R. In summary, using the proposed method, users will be able to test the presence of interaction between rows and columns. They will also obtain a two-mode partition of the data set based on maximal interaction, as well as estimates of the model parameters, i.e. interaction effects, row cluster weights and main effects.
We assessed the performance of the proposed method by means of a simulation study and a real-life application. We further compared the proposed method with other competing methods, wherever possible. Specifically, in the simulation studies, we studied the performance of E-ReMI in terms of Type-I error rate, power and parameter/partition recovery. Empirical Type-I error rates showed good performance, as their deviation from the nominal significance level was not more than what is expected by chance. In terms of power, as predictable, we observed lower power of the methods considered here when size of the data decreases or the number of row/column clusters increases. Misspecification of the number of clusters for the analysis generally has a detrimental effect on the performance of the methods, but for most performance criteria (in most scenarios) this effect is minimal. Boik's method has always a good performance and it is more robust to increased complexity of the data as compared to E-ReMI and REMAXINT, however it is not designed to facilitate interpreting the row by column interaction found. E-ReMI tends to perform better than REMAXINT under unequal expected row cluster sizes, whereas in a few cases REMAXINT performs better than E-ReMI under equal expected row cluster sizes. As for parameter/partition recovery E-ReMI tends to perform better than REMAXINT under unequal expected row cluster sizes and both methods perform equally well when data are generated with equal expected row cluster sizes. In an analysis of a real data set on altruism, we found strong empirical evidence for person by situation interaction based on REMAXINT and E-ReMI hypothesis testing. For these data, we further discussed the parameter estimates yielded by REMAXINT, E-ReMI and a Gaussian latent block mixture model and highlighted the different underlying structures that are uncovered by the methods, including their implications for important substantive research questions in contextualized personality psychology.
Several extensions of the work reported in this paper are possible. Firstly, a limitation of the current approach is the assumption of normality on the residual terms i j of the probabilistic model of E-ReMI. It is not clear how the method behaves in case of a violation of this assumption, which may be likely in applications. Therefore, it would be interesting to perform a study on the robustness of E-ReMI to violations of the normality assumption. This study is beyond the scope of this paper and it is currently under investigation. Secondly, just like REMAXINT, the newly proposed method E-ReMI requires that the number of row and column clusters are fixed a priori. Ahmed et al. (2021) proposed a post analysis approach to select optimal values for these parameters, but other approaches are certainly possible. For example, this can be done developing a method where P and Q are parameters to be estimated (see e.g., (Miller & Harrison, 2018)). Thirdly, it would be interesting to generalize the model such that it allows to perform two-mode maximal interaction clustering for binary or count response data. This can be done using a logistic and Poisson regression framework, respectively, but may necessitate using a marginal likelihood approach with a distributional assumption on the random row effects α i instead of a conditional likelihood approach. Finally, for some applications it may be more suitable to fit a model with a random effects assumption for the columns too (and assuming a latent random Q-partition of the column set). It is to be studied whether a conditional likelihood approach that conditions on sufficient statistics for the row main effects as well as sufficient statistics for the column main effects, is an appropriate choice as a method for estimating the model parameters of interest.
Data Availability R codes for generating simulated data as described in Section 4 are available in DataverseNL at the following url https://doi.org/10.34894/PWQHEC. We do not own the rights for the person by situation data used in Section 5.
Code Availability R codes for analyzing data as described in Sections 2 and 3 are available in DataverseNL at the following url https://doi.org/10.34894/PWQHEC.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Appendix A: Cross-product Term U
In this appendix, we show how to derive (5). The cross product term U in (4) can be rewritten as The second line is obtained by replacing B i j by its definition, i.e. B i j = d i j −d i· − β j − γ pq , see Table 1, and by taking the elements that do not depend on j out of its summation. The third Next we apply the Lagrange multiplier method to find the estimate of β j and γ pq . Focusing on the terms that depend on β j and γ pq the Lagrangian function is written as where λ β , λ γ q and λ γ p are the Lagrange multipliers associated to the equality constraints on β j and γ pq . Focusing on finding the maximum of the β j ( j = 1, . . . , J ) parameters, the first derivative is: Equating this equation to 0, and solving for β j leads to the following result, where the second line is obtained by applying the sum operators to each term within the brackets and multiplying each side of the equation by σ 2 . The third line is obtained by noting that P p=1 I i=1 z i p d i j = Id . j , P p=1 I i=1 z i pdi· = Id .. , P p=1 I i=1 z i p = I , I i=1 z i p = |R p | (for a specific p), and by solving for β j . In order to find the Lagrange multiplier λ β , the next step is to apply a sum operator over j to both sides of the equations leading to with the second line resulting from J j=1 β j = 0 (constraint) and J j=1d . j − J j=1d .. = 0, and the final result obtained by solving for λ β . This result, together with the equation obtained for β j , leads toβ j =d . j −d .. , since, after replacing λ β in the derivative, the last two terms cancel out.
The estimate of γ pq can be obtained similarly. First, we compute the first derivative of (15) w.r.t. γ pq Equating this equation to zero, and solving for γ pq leads to the following result, where in the first line β j is replaced by its estimate, and the second line is obtained by solving for γ pq and noting that I i=1 J j=1 z i p k jq γ pq = γ pq I i=1 z i p J j=1 k jq and γ pq I i=1 z i p J j=1 k jq = γ pq |R p ||C q |. In order to find the Lagrange multiplier λ γ q , the next step is to apply a sum operator over p to both sides of the equations leading to λ γ q = 0, with the first equality obtained by replacing λ γ p with − |R p | J Q q=1 λ γ q and the second noting that P p=1 |R p | = I .