Maximal Interaction Two-Mode Clustering

: Most classical approaches for two-mode clustering of a data matrix are designed to attain homogeneous row by column clusters (blocks, biclusters), that is, biclusters with a small variation of data values within the blocks. In contrast, this article deals with methods that look for a biclustering with a large interaction between row and column clusters. Thereby an aggregated, condensed representation of the existing interaction structure is obtained, together with corresponding row and column clusters, which both allow a parsimonious visualization and interpretation. In this paper we provide a statistical justiﬁcation, in terms of a probabilistic model, for a two-mode interaction clustering criterion that has been proposed by Bock (1980). Furthermore, we show that maximization of this criterion is equivalent to minimizing the classical least-squares two-mode partitioning criterion for the double-centered version of the data matrix. The latter implies that the interaction clustering criterion can be optimized by applying classical two-mode partitioning algorithms. We illustrate the usefulness of our approach for the case of an empirical data set from personality psychology and we compare this method with other biclustering approaches where interactions play a role.


Introduction
Observed data often can be represented under the form of an I by J data matrix D = (d i,j ).We consider the case in which the rows and columns of the data matrix constitute the levels of two categorical variables X and Y , say, with I and J categories, respectively, and the cell entries denote the observed values of a single quantitative dependent variable d.In the terminology introduced by Carroll and Arabie (1980), this type of data is called two-way two-mode.Such data matrices are collected, for instance, in contextualized personality research, when a set of I individuals (labeled by i = 1, ..., I) is measured on some behavior of interest d in J different situations (labeled by j = 1, ..., J ).Other examples can be found in the study of micro-array data in genome research where DNA expression levels d i,j are obtained for I genes under J different conditions, or in agricultural studies where, for instance, crop yield per hectare is recorded for crops of I different genotypes and at J different locations.
A major challenge in these and other fields of science is to capture the dominant structural information as included in data matrices.However, typically the size I × J of the data matrix D is large, by which we mean too large to succinctly describe and interpret the full information as included in the data at hand.Then understanding the overall structure of this information is a challenge.A useful way to resolve this problem then is to simultaneously cluster the rows and columns of the matrix D in P row clusters and Q column clusters, respectively, such that the I × J data matrix is partitioned into P Q biclusters (blocks) and its structural information is represented as much as possible in a condensed parsimonious way by P Q block-specific values.
Many two-mode clustering methods have been developed (for an overview, see Van Mechelen, Bock, and De Boeck, 2004;Govaert and Nadif, 2013) and, for a given data set and a substantive research question at hand, some of these are more suitable than others.We will focus on a twomode clustering criterion first proposed by Bock (1980), which explicitly addresses the row by column interaction in D (see formula (2) below).After Bock (1980), various two-mode clustering methods were proposed that use interactions concepts.However, most of these methods look for (possibly overlapping) blocks in the data matrix with a minimal within-block row by column interaction rather than for a representation of the full row by column interaction in the data matrix D. (These methods will be commented on in Section 6.1.)The criterion proposed by Bock (1980), which we will further call the maximal interaction two-mode clustering criterion, implies simultaneously looking for a partition of the row set X = {1, ..., I} and a partition of the column set Y = {1, ..., J } which are such that the implied interaction among the row clusters and the column clusters is maximal, on the average.
In this paper, we address two issues pertaining to maximal interaction clustering that have not been addressed so far: First, we develop a statistical justification for the criterion proposed by Bock and second, we show how the criterion can be optimized numerically.Concerning the first aspect we will describe a probabilistic model from which the maximal interaction two-mode clustering criterion results when using a classification likelihood approach for model estimation (Section 3).This is useful for understanding the conditions under which the proposed interaction criterion is likely to be successful (Banfield and Raftery, 1993;Bock, 1996) and helps clarifying its relation to other clustering methods.Concerning the second aspect we will show (Section 4) that optimizing the criterion in question is equivalent to minimizing a standard least-squares two-mode partitioning criterion for a suitably transformed version of the data matrix.The latter result implies that the optimal solution for maximal interaction two-mode clustering can be obtained by means of existing classical two-mode partitioning algorithms, some of which are available as free software and some of which have been tested extensively with regard to numerical performance (e.g., Van Rosmalen et al., 2009).In addition, we will apply the maximal interaction clustering approach to an empirical data set from personality research in psychology and show how it can capture indeed the gist of the interaction pattern of a data matrix under study (Section 5).
The remainder of this paper is organized as follows.In the next section, the maximal interaction two-mode clustering criterion is explained.Subsequently, in Section 3, a statistical justification is given for this criterion.Next, in Section 4, it is shown that maximal interaction two-mode clustering is equivalent to classical least-squares two-mode partitioning when applied to a suitable transformation of the data.Section 5 presents the data example and Section 6 provides a general discussion of our approach together with a detailed comparison to other interaction-based biclustering methods (Section 6.1).The paper ends with some concluding remarks.

Motivating Research Problems
Although it is not possible to find a single clustering criterion that is uniformly better than all other possible criteria, one criterion can be better than another one for a specific research question.For instance, when performing a biclustering in the usual way (with a deterministic method such as double k-means, or with a stochastic model such as some two-mode mixture model, which may be estimated in a frequentist or a Bayesian way) the resulting biclustering will be dominated by the row and column main effects.This results from the fact that these methods essentially rely on a clustering structure to capture the main effects and the row by column interaction as well (see Section 3.2).However, in many applications the focus of interest of the researchers is not so much in the main effects of the rows and columns, but more in the interaction between them (see also the references in Section 6.1).As we will explain in Section 2.2, maximal interaction clustering implies a simultaneous clustering of the rows and columns of the matrix D in P row clusters and Q column clusters, respectively, such that the row by column interaction pattern is highlighted as much as possible in a parsimonious way by P Q block-specific interaction values.This amounts to assuming that all interaction terms within the same bicluster are equal (whereas the main effects should play no role, see Section 3.1).In the remainder of this section we describe some applications where maximal interaction clustering is well suited.
In contextualized personality psychology, a critical challenge is to capture person by situation interactions (Mischel andShoda, 1995, 1998;Geiser et al., 2015).Indeed, a key question addressed by researchers in this field is whether the situation effect is the same for all individuals and, if not, what the structure of the person by situation interaction looks like.Furthermore, contextualized personality psychologists are typically less interested in situation main effects.The latter are considered to be part of general psychology, are often trivial and correspond to common sense.For instance, it should not come as a surprise to find that people respond more aggressively in situations that are more frustrating.Contextualized personality psychologists are, however, primarily interested in the shape of the individual-specific behavioral signature (i.e., the response pattern across different situations), in which the global level of this profile (i.e., the subject main effect) is less important or even of no interest at all.The shape of the signature is considered to be an important characteristic of an individual and contextualized personality psychologists advocate that it constitutes an essential part to the study of personality (Shoda et al., 2013(Shoda et al., , 2015)).For example, individuals might be characterized by different sensitivities to specific types of frustration such as responding aggressively as a result of being let down by others versus as a result of being narcissistically offended.
Another domain of application is agriculture where researchers obtain, from suitable field experiments, large genotype by environment data matrices on, for instance, crop yield with the research focus typically being on the genotype by environment interaction (G×E interaction: Corsten and Denis, 1990;Piepho, 1997Piepho, , 1999)), rather than on the genotype and environment main effects.For instance, the amount of rainfall may differ between locations and for plant breeders it is important to know whether in terms of crop yield genotypes are differentially sensitive to these variations across locations.If so, this "seriously limits efforts in selecting superior genotypes for both new crop introduction and improved cultivar development" (Shafii and Price, 1998).Moreover, it is then important to understand this G×E interaction pattern in order to make region-specific recommendations with regard to choosing genotypes and/or selecting locations for optimal crop yield.State-of-the-art methods for analyzing data from G×E studies include AMMI (additive main effects and multiplicative interaction effects) models (Gollob, 1968;Gauch, 2006;Gauch, Piepho, and Annicchiarico, 2008;Forkman and Piepho, 2014), which assume individual row and column main effects but yield a parsimonious representation of the row by column interaction structure by decomposing the latter into a small set of principal components.The idea is that the row (genotype/plant) and column (location) main effects should not be taken into account when looking for a parsimonious representation of the data matrix because, likely, different genotypes show individual stress effects and individual deviation of sunshine/soil fertility may be specific to each location.In contrast, AMMI models wish to characterize plant types in terms of their sensitivity to, for instance, soil type (sandy, organic, etc.) or other environmental characteristics such as altitude, wind, and so on.Maximal interaction clustering bears a close resemblance to these methods in that it also looks for a parsimonious representation of the row by column interaction that is invariant to the magnitude of the row and column main effects.However, this parsimonious representation of the interaction structure is obtained by means of a two-mode clustering, which yields results that may be considered more easily interpretable.In this regard one may note that the usual procedure when applying AMMI models is to assume two or three components and then to use biplots (Gabriel, 1971;Gower and Hand, 1996) in order to identify 'clusters' of similar genotypes and of similar environments.In our approach, this clustering is implied by the very nature of the method, obviating the need for a two-step approach.
Obviously, apart from agriculture, gene by environment interactions are also a key topic of interest in medicine (including psychopatology and psychiatry) (see, e.g., Hunter, 2005;Caspi and Moffitt, 2006;Moffitt, Caspi, and Rutter, 2006;National Institute of Environmental Health Services, 2016).Other applications may be found, for example, in marketing research (consumer behavior under different advertisement strategies).

Method
Consider a data matrix D = (d i,j ) I×J where d i,j denotes the observed value on some criterion variable d for level i of a first categorical predictor variable X and level j of a second categorical predictor variable Y .Let R = {R 1 , ..., R p , ..., R P } and C = {C 1 , ..., C q , ..., C Q } denote par-titions of the row set X and the column set Y, comprising P and Q clusters, respectively, and let #R p and #C q denote the cluster cardinalities of row cluster R p and column cluster C q , respectively.Furthermore, let the twomode cluster (bicluster, block cluster) R p × C q = {(i, j)|i ∈ R p , j ∈ C q } denote the Cartesian product of row cluster R p and column cluster C q .The observed amount of interaction associated with R p × C q can then be represented by the block interaction term where here and in the following we will use the notation: for the row and column means, for the mean value in block for the overall mean in D. Bock (1980) proposed to look for a two-mode partitioning (biclustering) R × C = {R p × C q ; p = 1, ..., P, q = 1, ..., Q}, for given numbers of clusters P and Q, that maximizes the overall interaction criterion (2) Two concerns are warranted with regard to criterion (2).Firstly, one may wonder to what degree this criterion can be justified in terms of a probabilistic model for the data d i,j .In the next section, this question will be answered by showing that an insightful probabilistic ANOVA model leads to the criterion in question.Secondly, maximizing (2) over all possible combinations of row and column partitions is a challenging combinatorial optimization problem for which no analytical solution exists and for which a complete enumeration of all possible solutions is computationally infeasible unless the number of rows and columns of D is very small.Therefore, in order to apply maximal interaction two-mode clustering to large data matrices, suitable approximate numerical optimization algorithms are needed.In Section 4, it will be shown that the problem of maximizing ( 2) is equivalent to minimizing a classical least-squares two-mode partitioning criterion for the double-centered data matrix.This fact essentially resolves the optimization problem for (2) since there exist a range of good numerical algorithms designed for classical least-squares two-mode partitioning, which provides the possibility of analyzing large empirical data sets by means of the maximal interaction two-mode clustering approach.

A Probabilistic Model for Maximal Interaction Two-Mode Clustering
In this section, a statistical justification for interaction criterion (2) is given in terms of a probabilistic ANOVA model for the data.In particular, we will show in this section that maximizing the classification likelihood for this model is equivalent to maximizing interaction criterion (2).
The model in question describes a situation in which each row i and each column j has its specific additive main effect (α i and β j , respectively), but the interaction terms are the same for all (i, j) within block R p × C q : d i,j = μ+α i +β j +γ p,q + i,j i ∈ R p , j ∈ C q , p = 1, ..., P, q = 1, ..., Q.
For a given two-mode partitioning R × C, maximum likelihood (m.l.) estimation of the unknown parameters μ, α i , β j , and γ p,q amounts to minimizing the quadratic criterion: subject to identification constraints (4a-4d).This is obvious for the case of a known variance σ 2 , but also holds for the case of an unknown σ 2 (where the maximum of (5) will typically be +∞, such that we have to restrain to a local maximum w.r.t.σ 2 > 0).At this point, it is convenient to introduce the following statistics: for i = 1, ..., I, j = 1, ..., J, p = 1, ..., P and q = 1, ..., Q.
Proof.Whereas this result could be derived from sufficiency concepts for exponential distribution families, we present here an elementary algebraic proof.After inserting μ, αi , βj , and γp,q (which are to be treated as constants), the squared-error residual sum S can be decomposed as follows: where U is a sum of cross-product terms that equals 0 (see below).Since the second sum is always non-negative, S is minimized if and only if μ = μ, α i = αi , β j = βj , and γ p,q = γp,q as asserted.For U we have C 3 and we will show that U = 0.In fact, the sum of the deviations Maximal Interaction Two-Mode Clustering Therefore, in U the partial sums related to A 1 and A 4 are 0 as well.Because of identification constraint (4a) and the fact that by definition ᾱ• = 0, the partial sum related to B 1 is 0 as well.The same holds for the partial sums related to B 2 and B 3 , respectively, after considering the identification constraints (4b) and (4c-4d), and the definitions of βj and γp,q .The sum related to C 1 is 0 because of identification constraint (4b) and the fact that by definition β• = 0.The sum related to C 2 is 0 because of identification constraint (4c) and the fact that by definition γp,• = 0. Similarly, the sum related to C 3 vanishes because of identification constraint (4d) and the fact that by definition γ•,q = 0. Finally, the sum related to A 2 is, considering the fact that by definition β• = 0 and γp,• = 0, given by In a similar way we show that the sum related to A 3 is 0 as well.So U is a sum of zero sums and equals 0.
It remains to consider optimization with respect to the two-mode partitioning R × C.

Proposition 2. Maximizing likelihood criterion (5), or minimizing quadratic criterion (6), is equivalent to maximizing interaction criterion (2).
Proof.Substituting the m.l.estimates of μ, α i , β j and γ p,q into equation ( 6) for S, we obtain the following quadratic criterion, which is to be minimized with respect to R and C: Since γp,q = g p,q as stated by Proposition 1, minimizing S (i.e., maximizing likelihood function ( 5)) over all R × C is equivalent to maximizing interaction criterion (2).
Remark 1: The previous model, formulas, and results apply also to the case when the data d i,j are multi-dimensional with values in R k , say, with i.i.

An ANOVA Model with Clustered Main Effects
Model (3) comprises I + J individual main effects and P • Q blockspecific interaction effects.An even more parsimonious model would read as follows: (for i ∈ R p , j ∈ C q , p = 1, ..., P, q = 1, ..., Q), where α p and β q represent P + Q cluster-specific main effects (with suitable identification constraints) of row cluster R p and column cluster C q , respectively (instead of the I + J main effects in the former model ( 3)).For this model, maximizing the classification likelihood comes down to minimizing the quadratic criterion with respect to R, C and the block centroid matrix M = (μ p,q ) P ×Q .Since, for a given two-mode partitioning R×C, partial optimization w.r.t.M yields the block means (m.l.estimates) μp,q = dR p ,C q for p = 1, ..., P, q = 1, ..., Q, this biclustering problem implies minimizing the criterion with respect to R and C, which is the classical least-squares two-mode partitioning problem (Van Mechelen, Bock, and De Boeck, 2004, pp. 373-374), sometimes referred to as double k-means.
Comparison of models ( 3) and ( 10) hence clarifies that maximizing interaction criterion (2) means concentrating on the row by column interaction only (while main effects may be row-and column-specific and insofar without any clustering structure), whereas, minimizing the classical least-squares two-mode partitioning criterion (11) tacitly assumes a clustering structure for the main effects as well, with the same clusters as for the row by column interaction.

Equivalence of Maximal Interaction Two-Mode Clustering and
Least-Squares Two-Mode Partitioning of Double-Centered Data In this section, it will first be shown (Section 4.1) that any two-mode partitioning R × C that maximizes interaction criterion (2) minimizes the classical least-squares two-mode (double k-means) partitioning criterion (11) when applied to the data matrix after double-centering and vice versa.Second (Section 4.2), an important consequence of the equivalence relation proven in this section will be discussed, that is, that no new numerical optimization algorithms have to be developed for maximal interaction two-mode clustering.

Proof of Equivalence
Let D * = (d * i,j ) denote the double-centered data matrix where is the individual deviation from additivity for (i, j).
Proposition 3. Maximizing interaction criterion (2) w.r.t. the two-mode partitioning R × C is equivalent to minimizing the classical least-squares twomode partitioning criterion ( 11) for the double-centered data matrix D * : Remark 2: Formulation (13) can be interpreted in the way that the maximal interaction criterion (2) looks for a two-mode partitioning R × C such that the individual deviations from additivity d * i,j are, on the average, as homogeneous as possible within the blocks R p × C q .
Proof.For any two-mode partitioning R × C, the total sum of squares T in the double-centered data matrix D * can be decomposed into a within-block and a between-block SSQ.Using the fact that by definition d * •,• = 0), we obtain: Since T does not depend on R and C it appears that any two-mode partitioning R × C that minimizes the first term at the right-hand side of ( 14) also maximizes the second term in that expression, and vice versa.Obviously, the first term at the right-hand side of ( 14) is the classical least-squares twomode partitioning criterion (11) when applied to the double-centered data.The proof will now be completed by showing that the second term at the right-hand side of ( 14) is identical to the interaction criterion (2).This is obvious if we can show that d * R p ,C q equals the block-specific interaction value g p,q from (1).In fact, making use of the definitions of d *

Important Algorithmic Consequence
Proposition 3 implies that any numerical optimization algorithm designed for classical least-squares two-mode partitioning, that is, designed for minimizing criterion (11), can also be used to maximize interaction criterion (2), just by substituting the original data matrix D in (11) by its doublecentered version D * .Fortunately, many algorithms for minimizing (11) have been proposed and evaluated.A selection of work in this area may be found in Gaul and Schader (1996); Baier, Gaul, and Schader (1997); Hansohm (2001); Vichi (2001); Castillo and Trejos (2002); Rocci and Vichi (2008) and Van Rosmalen et al. (2009).Therefore, there is no need to propose and evaluate novel optimization algorithms designed specifically for maximizing interaction criterion (2).

Application to Altruism Data
In this section, maximal interaction two-mode clustering is illustrated using data from the domain of contextualized personality psychology (Mischel andSchoda, 1995, 1998;Shoda et al., 2013Shoda et al., , 2015)), which aims at characterizing individual differences in behavior profiles across situations.An important challenge in this regard is to capture the gist of the person by situation interaction as included in behavioral data.Studying such interactions may reveal the underlying mechanisms through which the behavior under study comes about.

Maximal Interaction Results for Altruism Data
The data of our application stem from a study on altruism (Quintiens, 1999).The key question that goes with these data is to retrieve the mechanisms underlying individual differences in helping behavior.A group of I = 102 persons was presented with a set of J = 16 vignettes, each of which described in a few sentences an emergency situation that typically occurs in the everyday life of students, with a victim that could possibly be helped by the participant.Two (abbreviated) examples of such situation descriptions are: 'In a very crowded grocery store you see a little boy, weeping and crying for his mum', and 'Your neighbors ask you to care for their pets while they are abroad during the summer holidays and in return allow you to make use of their swimming pool'.The persons were asked to rate each situation with respect to the extent to which they would be willing to help the victim in it.For this purpose they had to use a 7-point scale from 0 (definitely not) through 6 (definitely yes).
To capture the dominant interaction pattern in the person by situation willingness to help data matrix D = (d i,j ), we used the maximal interaction two-mode clustering approach from Section 2.2 by minimizing the classical least-squares two-mode partitioning criterion (11) for the double-centered matrix D * with an algorithm implemented in free and user-friendly software called TwoMP (Schepers and Hofmans, 2009).Given pre-specified values P and Q, this algorithm starts from some initial two-mode partitioning R 0 × C 0 and proceeds by an alternating least-squares optimization approach in which the classifications of the row and column sets are updated in turn.Each update implies that, consecutively, rows (resp.columns) are optimally (re)assigned to one of the row (resp.column) clusters.At each evaluation of a candidate assignment, a corresponding update of the centroid matrix M = (μ p,q ) = ( dR p ,C q ) is computed.The alternating steps of updating the classifications of the row and column sets, respectively, are continued until the value of criterion (11) no longer decreases (or, equivalently, criterion (2) no longer increases).This algorithm is guaranteed to find a locally optimal solution which, as is well-known, is not necessarily the global optimum.Therefore, TwoMP allows the user to specify a desired number of different runs of the algorithm, each of which is initialized by an independently generated random start, and, among the multiple estimated solutions, selects the one for which (11) takes a minimal value.One may note that this alternating least-squares algorithm is a special case of the socalled DRIFT algorithm which can also handle three-mode partitioning and which has been tested extensively with regard to algorithmic performance (Schepers, Van Mechelen, and Ceulemans, 2006).
The person by situation willingness to help data matrix was clustered for all combinations of numbers of clusters P = 2, ..., 6 and Q = 2, ..., 6, and with 100 random starts for each such combination.In the framework of two-mode partitioning problems, various procedures for choosing the appropriate numbers of clusters P and Q have been proposed in the literature (see, e.g., Ceulemans and Kiers, 2006;Schepers, Ceulemans, and Van Mechelen, 2008;and Wilderjans, Ceulemans, and Meers, 2013).Instead of presenting a full analysis of our data, we refer, for illustration purposes, only to the optimal biclusterings with P +Q ≤ 5.For this subset of solutions, the best one (i.e., with maximal criterion value (2)) comprises P = 3 person clusters and Q = 2 situation clusters.The maximized value of interaction criterion (2) equals 223.9 for this solution.As the individual person by situation interaction sum of squares ( 14) equals T = 1941.9,this implies that by applying maximal interaction clustering 11.53% of this sum of squares is captured by the optimal biclustering with P = 3 person clusters and Q = 2 situation clusters.

Comparing to Results from Double K-Means and Two-Mode Mixture Results
We also analyzed the person by situation altruism data matrix D = (d i,j ) by two other widely used two-mode clustering methods in the social sciences, double k-means (Vichi, 2001) and two-mode Gaussian mixture analysis (Govaert and Nadif, 2013), in order to compare the resulting solutions with the one we obtained using maximal interaction clustering.
The double k-means solution was obtained by minimizing the classical least squares criterion (11) with the software TwoMP whereas the twomode Gaussian mixture solution was obtained using the R package blockcluster (Iovleff and Singh Bhatia, 2015).In order to maintain comparability with the results of maximal interaction clustering these two solutions were likewise obtained for P = 3 person clusters, Q = 2 situation clusters and making use of 100 different starts.Furthermore, for the Gaussian mixture solution, person and situation cluster labels were obtained by assigning each person (resp.situation) to the person (resp.situation) cluster for which its posterior cluster membership probability was the largest.
For the double k-means solution the value of interaction criterion (2) turned out to equal 139.32 (explaining 7.2% of T ) and for the two-mode mixture solution the value of this criterion was 129.67 (explaining 6.7% of T ).Hence, both double k-means and two-mode mixture analysis capture some of the person by situation interaction sum of squares.However, both methods perform worse in this regard than maximal interaction clustering (which explained 11.53% of T ).
To measure the agreement between the different biclusterings obtained for the altruism data, we also calculated the Hubert and Arabie adjusted Rand indices (ARI: Hubert and Arabie, 1985) between the person (resp.situation) clusterings as obtained by each pair of methods.The situation clusterings as obtained from double k-means and two-mode Gaussian mixture are identical (ARI = 1.00), but the situation clustering as obtained from the maximal interaction biclustering is different from these (ARI = .52).Similarly, the person clusterings as obtained from double k-means and two-mode Gaussian mixture analysis are more similar to each other (ARI = .53)than to the person clustering as obtained from the maximal interaction method (ARI = .18and ARI = .13).Note that, based on an extensive simulation study, Steinley (2004) concluded that values of ARI below .65 reflect poor agreement.For the altruism data this then implies that the maximal interaction clustering method yielded a biclustering that is substantively very different from the solutions yielded by double k-means and two-mode Gaussian mixture analysis, respectively.

Substantive Interpretation of Maximal Interaction Biclusters
In this section, we will discuss in detail some of the substantive implications and interpretations that are implied by the obtained maximal interaction biclustering solution.Table 1 and Figure 1 show the interaction terms (g p,q ) of all 2 • 3 = 6 pairs of person and situation clusters (block clusters).It appears that the largest block-specific interaction terms (i.e., deviations from additivity) are observed for person clusters P C 2 (20 persons) and P C 3 (34 persons) in situation cluster SC 2 (which includes 3 situations).Moreover, since the interaction terms associated to person cluster P C 1 are close to zero, it appears that for all persons within the person cluster P C 1 (48 persons), helping behavior across situations appears to be described accurately by an additive effect of the persons and situations.
To obtain a substantive psychological interpretation for the obtained row (person) and column (situation) clusters and to highlight the relevance of the obtained biclustering, we have also used additional analyses.In a first additional analysis, we have compared the situation clusters with an external rating from expert judges of the extent to which each situation j describes an equivocal event, that is, an event that can be interpreted in different ways.Formally, we have introduced the indicator variable V for membership of each situation j in column (situation) cluster SC 2 such that V j = 1 (0) if j ∈ SC 2 (j ∈ SC 1 ).It appeared that V had a relatively high correlation to the expert rating (r = 0.46) such that we may conclude that column cluster SC 2 comprizes more or less ambiguous situations.
In a second additional analysis, we looked for a substantive interpretation of the person clusters by analyzing their relationship to 16 external dispositional variables Z l (l = 1, ..., 16) that measure the general feelings and attitudes towards helping behavior for the persons i = 1, ..., 102 and that were recorded in the study.The row (person) clusters were described by the person cluster membership variable W such that for a person i we set W i = 1 (resp.2, 3) if i is in cluster P C 1 (resp.P C 2 , P C 3 ).In this framework, we studied if and how the variable W relates to the external variables Z l by considering a multinomial logistic regression model with W as dependent variable and Z l as predictor variables (where W i = 1, i.e., membership in person cluster P C 1 , was chosen as reference category): A forward selection strategy identified two of the 16 dispositional variables as significant (p < .05)predictors of W , namely Z 1 (i.e., the extent to which one feels satisfied when being helped by others) and Z 2 (i.e., the extent to  which one feels capable to empathize with others).The classification accuracy of predicting W using these two predictor variables amounts to 54%, which is significantly more than can be expected by chance.The estimated regression coefficients θcl for this multinomial regression are presented in Table 2.
Combining the results of the analyses discussed above, it appears from Table 2 and Figure 1 that a smaller amount of satisfaction when helped by others (which holds for P C 2 persons, that is, persons for which W i = 2) leads to less helping behavior in more ambiguous situations (i.e., the situations of SC 2 ) than can be expected from an additive effect of the persons and situations.In contrast, more satisfaction when helped by others combined with feeling less capable of empathizing with others (which holds for P C 3 persons, that is, persons for which W i = 3) leads to more helping behav-Table 2. Estimated regression coefficients θcl for multinomial regression of person cluster membership W on Z1 (i.e., the extent to which one feels satisfied when being helped by others) and Z2 (i.e., the extent to which one feels capable to empathize with others).

Group discrimination Z1 Z2
Wi = 2 versus Wi = 1 -.17 .06Wi = 3 versus Wi = 1 .19-.30 ior in ambiguous situations than can be expected on the basis of an additive effect of the persons and situations.At first sight, the lower self-reported level of empathy of P C 3 persons may seem counterintuitive.Yet, it can be explained in that in more ambiguous situations, the P C 3 persons, who feel the least capable of empathizing with others, will let their intention to help be driven mainly by the fact that they themselves would feel highly satisfied when helped by others.

Relation to Other Biclustering Methods Involving Interaction Concepts
After the paper by Bock (1980), various biclustering methods have been proposed that are based on interaction concepts.However, when comparing these methods it should be kept in mind that the approaches may refer to at least three different types of deviations from additivity, i.e.: -block-specific deviations from additivity g p,q = dR p ,C q − dR p ,• − d•,C q + d•,• from (1) used in the maximum interaction approach -individual overall deviations from additivity 12) -and individual bicluster-specific deviations from additivity Each of these may be useful for handling specific research questions.
A method that involves the individual bicluster-specific deviations from additivity (15) was proposed by Cheng and Church (2000) and is now one of the most cited methods in biclustering of microarray data.Specifically, Cheng and Church look for one or a few (possibly overlapping) biclusters R p × C q (with a row cluster R p ⊂ X and a column cluster C q ⊂ Y) of maximal size and that are such that within R p × C q the mean-squared value of the individual bicluster-specific deviations from additivity s (p,q) i,j is smaller than some pre-specified threshold , say.After such a bicluster has been found, the algorithm replaces the entries of this bicluster by random draws of a uniform distribution and may proceed in the same way to find additional biclusters in a stepwise fashion.In a related method, Cho et al. (2004) look for a two-mode partitioning R × C with a minimum overall sum H := i,j | 2 .A discussion of these and closely related biclustering methods can be found in Madeira and Oliveira (2004) and Tanay, Sharan, and Shamir (2005).
Both approaches differ from our maximal interaction clustering approach in some important respects.First, both approaches look for biclusters with (absolutely) small or minimum sums of squared interaction-type values s (p,q) i,j while our approach looks for bipartitions with a maximum sum of squared block-specific interaction values g p,q .Insofar, both Cheng and Church as well as Cho et al. look for biclusters with a negligible withinbicluster interaction while our approach generates biclusters such that row and column clusters show high between-bicluster interaction.It must be noted that minimizing the within-bicluster sums of squares of the individual bicluster-specific deviations from additivity does not imply that the overall between-bicluster sums of squares of block-specific deviations from additivity is maximized.
This can be illustrated insightfully by a small example.Consider the following data matrix  Cho et al. criterion this is indeed the optimal biclustering as it yields H = 0, that is, within each of the four biclusters R p ×C q (p = 1, 2, q = 1, 2) there are no interactions and only row and/or column main effects.However, this very same solution is not optimal according to interaction criterion (2).Specifically, the latter criterion takes a value of 9.0 for this solution.If, however, the column clusters are defined as C 1 = {1, 3} and C 2 = {2, 4} (while keeping the same row clusters), the value of interaction criterion (2) increases to 380.25, explaining 97.5% of the overall individual deviations from additivity sum of squares T .
Second, in the maximal interaction approach, after having removed the overall main effects of rows and columns, homogeneous biclusters are retained (see Remark 2 in Section 4.1), whereas Cheng and Church as well as Cho et al. consider heterogeneous biclusters with bicluster-specific main effects of rows and columns.
Third, although the biclusterings resulting from the maximal interaction and Cho et al. approaches both capture the total row by column interaction sum of squares, in the maximal interaction clustering approach this interaction is represented by the between-bicluster differences with regard to the bicluster centroids only, whereas in the Cho et al. approach it is represented by the between-bicluster differences with regard to the bicluster centroids plus the between-bicluster differences with regard to the biclusterspecific main effects.
Another interaction-related biclustering method is that of Corsten and Denis (1990), who proposed a method for"identifying simultaneously groups of unstructured rows and groups of unstructured columns (...) such that the interaction between row and column factors is due only to interactions between those groups" (p.207).This approach is based on an agglomerative hierarchical clustering procedure in each step of which either two rows (or row classes), or two columns (or column classes) are merged into one row or column class, respectively, based on proximity measures between all pairs of possibly merged rows and all pairs of possibly merged columns.This proximity measure is defined by the mean square for interaction in the data subset consisting only of the two rows or the two columns concerned.This method differs from our approach, among other things, in that it is a procedural clustering approach and as such does not involve an overall objective function (Van Mechelen, Bock, and De Boeck, 2004).Specifically, the stepwise approach considers only local interactions (i.e., calculated within the subset of data considered in each step) and therefore is at best indirectly related to the overall interaction criterion (2).To illustrate the difference, we reanalyzed the genotype by location data reported in Corsten and Denis (1990) by maximal interaction clustering (using the same software and algorithmic specifications as discussed in Section 5) and assuming 4 row clusters and 3 column clusters, since these are the numbers of row and column clusters of the solution reported by Corsten and Denis.The value of criterion (2) is equal to 933.6 for the biclustering solution we obtained by means of maximal interaction clustering, whereas it is equal to 555.3 for the solution reported by Corsten and Denis.This is a rather large difference considering the fact that the total row by column interaction sum of squares for this data set equals T = 2108.4.
Finally, a major difference between the interaction-related biclustering methods discussed above and maximal interaction biclustering is that this latter one can be justified as a classification likelihood estimation of probabilistic model (3) while such a theoretical basis is missing for the other approaches discussed above.Note that a purely additive model of the type d i,j = μ + α p + β q + i,j for i ∈ R p , j ∈ C q in analogy to (3) or (10) would lead, via a classification likelihood, to separately clustering the rows and columns of D according to the classical least-squares k-means clustering criterion (see Bock, 1968, pp. 40-43).

Main Effects vs. Interaction
Various two-mode clustering methods are able to capture to some extent row by column interactions.For instance, classical least-squares twomode partitioning (see Section 3.2), when applied to an arbitrary observed data matrix D, will typically yield a block centroid matrix M = ( mp,q ) with entries mp,q that are not equal to a sum of row and column main effects in M and that also include interaction effects.However, when for instance the row main effect in the data matrix D is very large, the resulting partition of the rows will largely be such that it captures this main effect as well as possible, and likewise in the case of a large column main effect.Indeed, in applications of classical least-squares two-mode partitioning, it is observed frequently that the obtained solution is mostly dominated by main effects of the row and/or column clustering(s).As discussed in the Introduction section, depending on the substantive-theoretical research question at hand, this may be undesirable.Maximal interaction two-mode clustering (which eliminates the main effects from the very beginning) may then be preferred because it focuses only on the row by column interaction.
From the discussion so far, it may appear that in the context of biclustering interest in structuring the row by column interaction is eventually at odds with interest in clustering the row and column main effects since, clearly, a biclustering R × C of the main effects will not necessarily be identical or even similar to a biclustering R ×C of the interaction only.Interestingly, however, the maximal interaction two-mode clustering approach can be complemented by analyses that exclusively focus on the main effects of the rows and columns of D. In particular, such analyses could be done by applying either least-squares one-mode partitioning to the row (resp.column) means of D, or by applying classical least-squares two-mode partitioning to the matrix D = ( di,j ), where di,j = d i,j − d * i,j denotes the deviations between the observed data and their double-centered counterparts.From a substantive point of view, such a combined approach may be sensible because it allows for distinct models describing the main effects structure on the one hand and the interaction effect structure on the other hand.This may allow the user to distinguish the underlying mechanisms that drive the main effects from the underlying mechanisms that drive the interaction.
Example: To illustrate the case where the clustering structure of the main effects and the clustering structure of the interaction are not captured by the Table 3.A 6 × 4 data matrix D = (di,j ), and corresponding individual row main effects αi, individual column main effects βj and individual row by column interaction terms γi,j, such that di,j = μ + αi + βj + γi,j with μ = 5. di,j αi γi,j j = 1 j = 2 j = 3 j = 4 j = 1 j = 2 j = 3 j = 4 same bipartition, Table 3 shows a small example of an hypothetical 6 × 4 data matrix D. The entries of this data matrix were generated according to the formula d i,j = μ + α i + β j + γ i,j with differently clustered main and interaction effects.Table 3 shows that the row main effect structure is induced by three row clusters R 1 = {1, 2}, R 2 = {3, 4}, and R 3 = {5, 6} (see the α i ), and the column main effect structure by two column clusters C 1 = {1, 2} and C 2 = {3, 4} (see the β j ).In contrast, the structure of the interaction terms γ i,j is represented by a bipartition with two row clusters R1 = {1, 2, 3} and R2 = {4, 5, 6} and two column clusters C1 = {1, 3} and C2 = {2, 4}.In fact, applying the maximal interaction criterion (2) to the data matrix D will reconstruct the latter interaction-based bipartition, while the result from classical least-squares two-mode partitioning using criterion (11) will typically miss this bipartition (and also the one related to the main effects).

Extension to More Than Two Categorical Predictor Variables
Extending maximal interaction two-mode clustering to more than two categorical predictor variables is straightforward, as long as the data have been collected with a fully factorial design (which implies that they can be arranged into an N -way N -mode array D).Perhaps even more so than in the two-mode case, a reduction of the number of elements of each mode is essential in case one wishes to understand the structural information that pertains to the interactions in large multiway data arrays D. This could be achieved in a similar way as discussed in this paper for the two-mode case.squares three-mode partitioning to the triple-centered data making use of standard three-mode partitioning algorithms (Kiers, 2004;Schepers, Van Mechelen, and Ceulemans, 2006).Further extensions to the N -way case with N > 3 are straightforward.

Conclusion
In this paper, we have shown that the interaction clustering criterion (2), proposed by Bock (1980), which focuses on the row cluster by column cluster interaction, can be justified in terms of a specific probabilistic ANOVA model (3) with individual main effect terms and block-specific interaction terms.In particular, we have shown that maximizing the classification likelihood for this probabilistic model is equivalent to maximizing Bock's interaction criterion (2).This result is useful because it facilitates comparisons of maximal interaction two-mode clustering to other clustering approaches and facilitates understanding the conditions under which the former approach is likely to be successful.
Secondly, we have shown that maximizing Bock's interaction clustering criterion (2) (i.e., maximizing the classification likelihood for model (3)) is equivalent to minimizing the classical least-squares two-mode partitioning criterion (11) when applied to the data matrix after double-centering.This latter result is useful because many good algorithms for classical leastsquares two-mode partitioning already exist such that it is no longer necessary to develop a special one for the maximum interaction criterion (2).
d. normal errors i,j ∼ N k (0, σ 2 I k ).The only change consists in replacing the absolute values |...| by the Euclidean norm ||...|| and dot products by the inner product in R k .

Figure 1 .
Figure1.Visual display of the block-specific interaction terms gp,q for all pairs of person (P Cp) and situation (SCq) clusters.
The overall individual deviations from additivity sum of squares T is equal to 390.0.Assuming two row clusters and two column clusters, the optimal solution according to the criterion H by Cho et al. includes row clusters R 1 = {1, 2} and R 2 = {3, 4}, and column clusters C 1 = {1, 2} and C 2 = {3, 4}.Note that according to the In particular, in the three-way case one should first triple-center the observed data array D such thatd * i,j,k = d i,j,k − di,•,• − d•,j,• − d•,•,k +2 d•,•,• (similarly to the definition of d * i,j in Subsection 4.1), and subsequently apply least-

Table 1 .
Block-specific interaction terms gp,q for all pairs of person (P Cp) and situation (SCq) clusters.