1 Introduction

Multiway data, a generalization of the familiar two-way samples-by-variables data matrix, is becoming more common in a variety of fields. In computer vision applications, images are stored as three-way arrays, or third-order tensors, with rows and columns representing pixel locations and the third way (tubes) representing different color channels (e.g., Krizhevsky et al. 2012). Videos constitute fourth-order tensors, since they capture a sequence of such images over time (e.g., Abu-El-Haija et al. 2016). In the social sciences, a marketing research survey asking multiple individuals to rate several products on various characteristics using a Likert scale generates a three-way array of rating scores (e.g., DeSarbo et al. 1982). Similar research designs are commonly used in sensometrics (e.g., Cariou et al. 2021). Other applications include: high-throughput molecular data in bioinformatics (e.g., Lonsdale et al. 2013); spectroscopic data in chemometrics (e.g., Faber et al. 2003; Bro 2006); and neuroimaging data, collected using electroencephalography (EEG) or functional magnetic resonance imaging (fMRI), for example (e.g., Genevsky and Knutson 2015).

A taxonomy of measurement data is given by Carroll and Arabie (1980), where a mode is defined as “a particular class of entities” associated with a data array, such as stimuli, subjects or scale items; a three-way array may have up to three modes. Kiers (2000) introduced standardized notation and terminology for multiway analysis, while Kroonenberg (2008) is devoted to multiway data analysis methodology.

Performing cluster analysis on one or more of the modes of a three-way data array \(\underline{\mathbf {X}}\) is an important problem in multiway data analysis. Statistical research into three-way clustering can be broadly divided into two streams. The application of finite mixture models for clustering three-way data has been studied in, amongst others, Basford and McLachlan (1985), Hunt and Basford (1999), Vermunt (2007), Viroli (2011), Meulders and De Bruecker (2018), Gallaugher and McNicholas (2020a, 2020b). Whereas these rely on maximum likelihood estimation, another research stream have focused on nonparametric data approximation methods mainly via least-squares estimation. The papers of Vichi (1999), Rocci and Vichi (2005), Vichi et al. (2007), Papalexakis et al. (2013), Wilderjans and Ceulemans (2013), Llobell et al. (2019), Llobell et al. (2020) and Cariou et al. (2021) belong to this stream, as does our paper. An approach often used in this stream is to generalize a three-way data decomposition, such as Tucker’s three-mode factor analysis (Tucker3; Tucker 1966) or the Candecomp/Parafac decomposition (CP; Carroll and Chang 1970; Harshman 1970; Hitchcock 1927), by adding clustering over one or more of the modes.

Here we propose a new method—called least-squares bilinear clustering (or lsbclust)—derived in a similar manner but from a different starting point. In contrast to starting with a three-way decomposition, we start with a bilinear (or biadditive) decomposition of the matrix slices \(\mathbf {X}_i \, (i = 1, \ldots , I)\) of \(\underline{\mathbf {X}}\) for the mode (by convention, the first mode) over which we want to cluster (e.g., Denis and Gower 1994). An example of a bilinear approximation of a matrix \(\mathbf {X}_i\) is

$$\begin{aligned} \mathbf {X}_{i} \approx m \varvec{1} \varvec{1}^{'} + \varvec{a} \varvec{1}^{'} + \varvec{1} \varvec{b}^{'} + \varvec{c} \varvec{d}^{'}, \end{aligned}$$

where \(\varvec{1}\) denotes a column vector of ones. The terms on the right-hand side of (1) can be interpreted as the grand mean (m), row main effects (\(\varvec{a}\)), column main effects (\(\varvec{b}\)) and row–column interactions (\( \varvec{c} \varvec{d}^{'}\)), respectively, and are subject to appropriate identification restrictions.

By adding one or more sets of clusters to the effects on the right-hand side of Eq. (1), and by decomposing all matrix slices at the same time, the lsbclust method ensures that the estimated effects are equal for observations in the same cluster. The choice of bilinear decomposition has two interesting features. First, it allows the results to be displayed graphically using biplots (Gower et al. 2011; Gower and Hand 1996; Gabriel 1971), and other standard graphs. Second, it provides the possibility to have a different set of clusters for each of the four types of effects. This can be useful in cases where interesting differences can be expected not only with respect to the row–column interactions, but also with respect to the grand mean and main effects on the right-hand side of (1). We will discuss this aspect further in Sect. 3.

The main contribution of this paper is the introduction of the lsbclust model and loss function, together with an algorithm for estimating the parameters. We also provide an implementation of this algorithm and its related graphical procedures in an R (R Core Team 2020) package lsbclust (available on the Comprehensive R Archive Network, or CRAN), while supplemental materials illustrate its use. The method can be used as a complement to or replacement of other three-way data analysis procedures.

The remainder of the paper is structured as follows. Section 2 introduces the basic model and loss function used, which Sect. 3 augments with clustering to arrive at the lsbclust formulation. Section 4 shows how to simplify the loss function, and Sect. 5 discusses our algorithm for estimating the parameters. Enhancements to the basic method are discussed in Sect. 6, including using different bilinear models and aspects of biplot construction. The results of a simulation study is reported in Sect. 7. An empirical example is presented in Sects. 8, and 9 concludes.

Before we introduce lsbclust in more detail, we provide more detail on related methods in Sect. 1.1.

1.1 Related methods

The most prominent multiway data analysis methods, which include the tucker3 and CP decompositions, seek to generalize the singular value decomposition (SVD) of a matrix to the multiway case. Kolda and Bader (2009) provide an excellent review of these tensor decomposition methods and their applications across diverse fields. De Silva and Lim (2008) discusses mathematical aspects of generalizing the Eckart–Young theorem (Stewart 1993; Eckart and Young 1936; Schmidt 1907) to higher-order tensors, while Kiers and Van Mechelen (2001) discuss and illustrate the application of the tucker3 decomposition.

Three-mode factor analysis (Tucker 1966; Kroonenberg and de Leeuw 1980), commonly referred to as Tucker3, is the most general widely-used three-way method. Let \(\mathbf {X}_{J,KI} = \begin{bmatrix} \mathbf {X}_{1}&\mathbf {X}_2&\cdots&\mathbf {X}_{I} \end{bmatrix}\), and let the component matrices \(\mathbf {B}\), \(\mathbf {C}\) and \(\mathbf {D}\) be low-dimensional columnwise orthonormal configurations for the first, second and third ways of \(\underline{\mathbf {X}}\) with dimensions P, Q and R respectively. Also, let \(\underline{\mathbf {H}}: P \times Q \times R\) be the so-called core array, which gives the interactions between the elements of the component matrices. The Tucker3 model has loss function

$$\begin{aligned} L_{\textsc {T3}}(\mathbf {B}, \mathbf {C}, \mathbf {D}, \underline{\mathbf {H}})&= \left\| \mathbf {X}_{J,KI} - \mathbf {C} \mathbf {H}_{Q,RP}(\mathbf {B} \otimes \mathbf {D})^{'} \right\| ^{2} \nonumber \\&= \sum _{i=1}^{I} \left\| \mathbf {X}_{i} - \sum _{p=1}^{P} b_{ip} \mathbf {C} \mathbf {H}_{p} \mathbf {D}^{'} \right\| ^{2}, \end{aligned}$$

where \(b_{ip}\) denotes an element of \(\mathbf {B}\), \(\mathbf {H}_p: Q \times R\) is a slice of \(\underline{\mathbf {H}}\) along the first mode, \(\Vert \cdot \Vert \) is the Frobenius norm and \( \otimes \) is the Kronecker product. Interpreting a Tucker3 solution involves interpreting the three component matrices and the core. No clustering is performed, and (2) is typically minimized by alternating least squares (ALS; Kroonenberg and de Leeuw 1980).

The CP decomposition (Carroll and Chang 1970; Harshman 1970) is a restricted version of Tucker3 where \(P = Q = R\) and \(\underline{\mathbf {H}}\) is replaced by \(\underline{\mathbf {E}}\). Here \(\underline{\mathbf {E}}\) contains elements \(e_{pqr} = 1\) if \(p = q = r\) and \(e_{pqr} = 0\) otherwise. This implies that each component (column) in \(\mathbf {B}\) is related to a single component in \(\mathbf {C}\) and \(\mathbf {D}\), and vice versa, while in Tucker3 all components in one mode are related to all other components in the other modes. The CP loss function is

$$\begin{aligned} L_{\textsc {CP}}(\mathbf {B}, \mathbf {C}, \mathbf {D})&= \left\| \mathbf {X}_{J,KI} - \mathbf {C} \mathbf {E}_{P,PP}(\mathbf {B} \otimes \mathbf {D})^{'} \right\| ^{2} \nonumber \\&= \sum _{i=1}^{I} \left\| \mathbf {X}_{i} - \mathbf {C} {{\,\mathrm{diag}\,}}(\varvec{b}_{i}) \mathbf {D}^{'} \right\| ^{2}. \end{aligned}$$

Here \({{\,\mathrm{diag}\,}}(\varvec{b}_{i})\) is the diagonal matrix with the ith row of \(\mathbf {B}\) on the diagonal, and \(\mathbf {E}_{P,PP} = \begin{bmatrix} \mathbf {E}_1&\mathbf {E}_2&\cdots&\mathbf {E}_{p} \end{bmatrix}\) with \(\mathbf {E}_p\) being the pth matrix slice of \(\underline{\mathbf {E}}\).

Neither of these methods employ clustering along any of the modes, but Rocci and Vichi (2005) formulate such a variant of Tucker3, namely T3clus. The idea is to replace \(\mathbf {B}\) by the indicator matrix \(\mathbf {G}: I \times U\), which simply indicates which of U clusters each of the I observations belongs to. The core array \(\underline{\mathbf {H}}\) now represents the three-way array of cluster centroids in the reduced component space. The T3clus loss function is

$$\begin{aligned} L_{\textsc {T3clus}}(\mathbf {G}, \mathbf {C}, \mathbf {D}, \underline{\mathbf {H}})&= \sum _{i=1}^{I} \sum _{u=1}^{U} g_{iu} \left\| \mathbf {X}_{i} - \mathbf {C} \mathbf {H}_{u} \mathbf {D}^{'} \right\| ^{2}. \end{aligned}$$

Another related approach is the clusterwise CP method of Wilderjans and Ceulemans (2013)—CPclus hereafter. The loss function for this approach is

$$\begin{aligned} L_{\textsc {CPclus}}(\mathbf {G}, \mathbf {B}, \{\mathbf {C}_{u}\}_{u=1}^{U}, \{\mathbf {D}_u\}_{u=1}^{U})&= \sum _{i=1}^{I} \sum _{u=1}^{U} g_{iu} \left\| \mathbf {X}_{i} - \mathbf {C}_u^{} {{\,\mathrm{diag}\,}}(\varvec{b}_{i}) \mathbf {D}_{u}^{'} \right\| ^{2}. \end{aligned}$$

In contrast to T3clus, here the component matrices are cluster-specific.

Table 1 A summary of loss functions for the three-way methods discussed in Sect. 1.1

Table 1 gives a summary of the loss functions of these three-way decompositions, together with the lsbclust loss function for the interaction effects (where \(\mathbf {J}\) denotes a centring matrix of the appropriate size). From the table it is clear that each method approximates the matrix slices \(\mathbf {X}_{i} \, (i = 1, \ldots , I)\) in different ways. Tucker3 and Candecomp/Parafac are symmetric with respect to all three modes, but do not involve clustering. In case all \(\mathbf {X}_{i}\) are double-centred, T3clus approximates \(\mathbf {X}_{i}\) in cluster u by \(\mathbf {C} \mathbf {H}_{u} \mathbf {D}^{'}\), as compared to \(\mathbf {C}_{u}^{} \mathbf {D}_{u}^{'}\) for lsbclust. T3clus therefore requires the clusterwise approximation of \(\mathbf {X}_i\) to lie in the same row and column subspaces across clusters, while lsbclust has no such restrictions. The lsbclust loss function is clearly a constrained version of CPclus which replaces \({{\,\mathrm{diag}\,}}(\varvec{b}_{i})\) with the identity matrix. This reduces the number of parameters and simultaneously enforces the same interpretation for all members of the same cluster. The additional constraint means that a single biplot can be used to interpret the interaction effects in a cluster, instead of requiring one plot per cluster member.

In the special case where \(\underline{\mathbf {X}}\) contains indicator matrices for each subject, lsbclust has a strong connection to the latent-class bilinear multinomial logit (lc-bml) model of van Rosmalen et al. (2010). Their model was developed specifically to deal with response styles when analyzing two-way self-report survey data. More details on the lc-bml model, and its connection to lsbclust, are given in the supplemental materials.

The next section discusses the basic building blocks of lsbclust.

2 Basic model and loss function

Consider real-valued data collected in a three-way array \(\underline{\mathbf {X}}\). For the marketing research survey example, the entry \(x_{ijk}\) of \(\underline{\mathbf {X}}\) is the rating by person i of product j on characteristic k, with \(i = 1, \ldots , I; j = 1, \ldots , J\); and \(k = 1, \ldots , K\), respectively. Let \(\mathbf {X}_{i}\) denote the \(J \times K\) matrix of responses for subject i, which is also the ith (frontal) slice of the array \(\underline{\mathbf {X}}\). By convention, we single out the first way of \(\underline{\mathbf {X}}\) by treating the \(\mathbf {X}_i\) as observations, but the method could also be applied to the second or third way of \(\underline{\mathbf {X}}\).

We derive the proposed lsbclust formulation by augmenting (with clusters; Sect. 3) the following least-squares loss function:

$$\begin{aligned} L(m,\varvec{a},\varvec{b},\mathbf {C,D}) = \sum _{i=1}^I\left\| \mathbf {X}_{i} - \left( m\varvec{1}_{J}^{}\varvec{1}_{K}^{'} + \varvec{a}\varvec{1}_{K}^{'} + \varvec{1}_{J}^{} \varvec{b}^{'} + \mathbf {C}\mathbf {D}^{'}\right) \right\| ^2. \end{aligned}$$

Here \(\Vert \cdot \Vert \) is the Frobenius norm. Moreover, \(\varvec{1}_{K}^{}\) is the length-K vector of ones; below we will drop the subscript—as in \(\varvec{1}\)—for simplicity. Equation (6) approximates each \(\mathbf {X}_i\) using the same bilinear (or biadditive) model. This model contains a grand mean m, the row and column main effects \(\varvec{a}\) and \(\varvec{b}\) respectively, and row–column interactions \(\mathbf {C} \mathbf {D}^{'}\).

The matrices \(\mathbf {C}\) and \(\mathbf {D}\) are low-dimensional representations of the rows (products) and columns (characteristics) of the \(\mathbf {X}_i\) matrices respectively, but after adjusting for main effects. Representing the interaction effects using such inner products permit these to be displayed in biplots if the dimensionality of \(\mathbf {C}\) and \(\mathbf {D}\) are low enough so that displays can be made (Gower et al. 2011; Gower and Hand 1996; Gabriel 1971). Therefore, the dimensionality is typically set to two, although values up to \(\min \{J,K\} - 1\) are possible. To ensure uniqueness of the model, the usual sum-to-zero constraints \(\varvec{a}^{'} \varvec{1}_{}^{} = \varvec{b}^{'}\varvec{1}_{}^{} =0\) and \(\varvec{1}_{}^{'}\mathbf {C} = \varvec{1}_{}^{'}\mathbf {D}= \varvec{0}\) must be imposed. Additionally, the columns of \(\mathbf {C}\) and \(\mathbf {D}\) are required to be orthogonal (for more information, see Denis and Gower 1994, as well as Sect. 6.3). Model (6) has an analytical solution.

3 Capturing heterogeneity with clusters

Equation (6) applies the same parameters to the entire data set. To capture potential heterogeneity, we allow for the existence of a prespecified number of clusters in a single mode between which the parameters in the bilinear decomposition of the \(\mathbf {X}_i\) matrices may vary. Moreover, since the bilinear decomposition itself has four different sets of parameters, we introduce four sets of clusters—one for each type of parameter. This allows the model to recognize that, for example, although \(\mathbf {X}_i\) and \(\mathbf {X}_{i'}\) (\(i \ne i'\)) are similar with respect to main effects, they could differ in the interaction effects (or vice versa).

Let \(\mathbf {G}^{(\mathrm {o})}\) be the \(I \times R\) matrix of cluster memberships for the grand mean effect m, which has \(g_{ir}^{(\mathrm {o})} = 1\) if person i belongs to cluster r and \(g_{ir}^{(\mathrm {o})} = 0\) otherwise (\(r = 1, 2, \ldots , R\)). Similarly, \(\mathbf {G}^{(\mathrm {r})}\) is the \(I \times S\) matrix of cluster memberships for the row effects, \(\mathbf {G}^{(\mathrm {c})}\) the \(I \times T\) matrix of cluster memberships for the column effects, and \(\mathbf {G}^{(\mathrm {i})}\) the \(I \times U\) matrix of cluster memberships for the interaction effects. Now, by incorporating the clustering, the least-squares loss function becomes

$$\begin{aligned}&L(\mathbf {G}^{(\mathrm {o})}, \mathbf {G}^{(\mathrm {r})}, \mathbf {G}^{(\mathrm {c})}, \mathbf {G}^{(\mathrm {i})}, \varvec{m}, \mathbf {A},\mathbf {B}, \mathbf {C}, \mathbf {D}) \nonumber \\&\quad = \sum _{i,r,s,t,u} g_{ir}^{(\mathrm {o})} g_{is}^{(\mathrm {r})} g_{it}^{(\mathrm {c})} g_{iu}^{(\mathrm {i})} \left\| {\mathbf {X}_i - \left( m_{r}^{}\varvec{1}_{}^{}\varvec{1}_{}^{'} + \varvec{a}_{s}^{}\varvec{1}_{}^{'} + \varvec{1}_{}^{} \varvec{b}_{t}^{'} + \mathbf {C}_{u}^{}\mathbf {D}_{u}^{'}\right) }\right\| ^2 \nonumber \\&\quad = \sum _{i,r,s,t,u} g_{ir}^{(\mathrm {o})} g_{is}^{(\mathrm {r})} g_{it}^{(\mathrm {c})} g_{iu}^{(\mathrm {i})} L(i|r,s,t,u), \end{aligned}$$

with the summation over all observations (I) and clusters (RST and U). Here \(\varvec{m}^{'} = [m_1 \; \cdots \; m_R]\), \(\mathbf {A}= [\varvec{a}_1 \; \cdots \;\varvec{a}_S ]\), \(\mathbf {B}= [\varvec{b}_1 \; \cdots \;\varvec{b}_T ]\), \(\mathbf {C}_{}^{'} = [\mathbf {C}_1^{'} \; \cdots \; \mathbf {C}_U^{'}]\) and \(\mathbf {D}_{}^{'} = [\mathbf {D}_1^{'} \; \cdots \; \mathbf {D}_U^{'}]\). Note that it is assumed implicitly that all low-rank decompositions are of the same rank P for all clusters. For identifiability, the same sum-to-zero constraints now apply to each cluster-specific set of parameters, i.e., \(\mathbf {D}_{u}^{'} \varvec{1}_{}^{} = \varvec{0}\) for all \( u = 1, \ldots , U\). The columns of all \(\mathbf {C}_u\) and \(\mathbf {D}_u\) matrices are orthogonal.

Equation (7) allows for a total of RSTU clusters by clustering each \(\mathbf {X}_i\) on four different types of effects at the same time. However, we show next that the joint clustering problem can be simplified into four separate clustering problems, significantly reducing the computational complexity since only \(R + S + T + U\) clusters will need to be found.

4 Decomposing the loss function

To simplify the exposition, note that mathematically we can drop the sum-to-zero constraints from the formulation by introducing centering matrices of the form

$$\begin{aligned} \mathbf {J}_{c}^{} = \mathbf {I}_c - \frac{1}{c} \varvec{1}_{c}^{} \varvec{1}^{'}_{c}, \end{aligned}$$

for some positive integer c controlling the size of the matrix (which we duly omit below for simplicity). This is done by redefining the terms in the summation in (7) as

$$\begin{aligned} L(i|r,s,t,u)&= \left\| \mathbf {X}_i - \left( m_r\varvec{1}_{}^{} \varvec{1}_{}^{'} + \mathbf {J}_{}^{}\varvec{a}_{s}^{}\varvec{1}_{}^{'} + \varvec{1}_{}^{} \varvec{b}_{t}^{'}\mathbf {J}_{}^{} + \mathbf {J}_{}^{} \mathbf {C}_{u}^{}\mathbf {D}_{u}^{'}\mathbf {J}_{}^{}\right) \right\| ^2, \end{aligned}$$

such that the sum-to-zero constraints on the columns of \(\mathbf {C}_u\) and \(\mathbf {D}_u\), and on \(\varvec{a}_s\) and \(\varvec{b}_t\), are automatically enforced. For example, estimating the parameters in \(\mathbf {J}_{}^{} \varvec{a}_s^{}\) is equivalent to estimating \(\varvec{a}_s^{}\) subject to \(\varvec{1}_{}^{'} \varvec{a}_{s}^{} = 0\). To simplify the notation, we redefine \(\mathbf {A}= [\mathbf {J}_{}^{} \varvec{a}_1 \; \cdots \;\mathbf {J}_{}^{}\varvec{a}_S ]\) to avoid writing \(\mathbf {J}\mathbf {A}\). The matrices \(\mathbf {B}\), \(\mathbf {C}_{}^{}\) and \(\mathbf {D}_{}^{}\) are also redefined analogously.

We proceed to simplify (9) by first expanding the double-centred \(\mathbf {J}_{}^{} \mathbf {X}_{i}^{} \mathbf {J}_{}^{}\) into separate terms. Then, by adding two additional centering operators for the row and column means, \(\mathbf {X}_{i}\) can be rewritten as

$$\begin{aligned} \mathbf {X}_i = \frac{\varvec{1}_{}^{'}\mathbf {X}_i^{} \varvec{1}_{}^{}}{JK}\varvec{1}_{}^{}\varvec{1}_{}^{'} + \frac{1}{J}\varvec{1}_{}^{}\varvec{1}_{}^{'}\mathbf {X}_{i}^{} \mathbf {J}_{}^{} + \frac{1}{K}\mathbf {J}_{}^{}\mathbf {X}_{i}^{}\varvec{1}_{}^{}\varvec{1}_{}^{'} + \mathbf {J}_{}^{} \mathbf {X}_i^{}\mathbf {J}_{}^{}. \end{aligned}$$

We can now associate each of the terms in (10) with the corresponding terms in the model (9):

$$\begin{aligned} L(i|r,s,t,u) =&\left\| \left( \tfrac{1}{JK} \varvec{1}_{}^{'}\mathbf {X}_i^{} \varvec{1} - m_r\right) \varvec{1}\varvec{1}_{}^{'} \right. + \varvec{1}_{}^{} \left( \tfrac{1}{J}\varvec{1}_{}^{'}\mathbf {X}_{i}^{} - \varvec{b}_{t}^{'}\right) \mathbf {J}_{}^{} \nonumber \\&{}+ \mathbf {J} \left( \tfrac{1}{K}\mathbf {X}_{i}^{}\varvec{1} - \varvec{a}_{s}^{}\right) \varvec{1}_{}^{'} + \mathbf {J} \left( \mathbf {X}_i^{} - \mathbf {C}_{u}^{} \mathbf {D}_{u}^{'}\right) \mathbf {J} \left. \!\right\| ^2. \end{aligned}$$

It can be shown (see the appendix in the supplemental materials) that the decomposition (11) is orthogonal such that

$$\begin{aligned} L(i|r,s,t,u)&= JK \left\| \left( \tfrac{1}{JK} \varvec{1}_{}^{'}\mathbf {X}_i^{} \varvec{1}_{}^{} - m_r\right) \right\| ^2 + J\left\| \left( \tfrac{1}{J}\varvec{1}_{}^{'}\mathbf {X}_{i}^{} - \varvec{b}_{t}^{'}\right) \mathbf {J}_{}^{} \right\| ^{2} \nonumber \\&\quad + K\left\| \mathbf {J}_{}^{} \left( \tfrac{1}{K}\mathbf {X}_{i}^{}\varvec{1}_{}^{} - \varvec{a}_s\right) \right\| ^{2} + \left\| \mathbf {J}_{}^{} (\mathbf {X}_i^{} - \mathbf {C}_{u}^{} \mathbf {D}_{u}^{'})\mathbf {J}_{}^{} \right\| ^2 \nonumber \\&= L_{\mathrm {(o)}} (i|r) + L_{\mathrm {(r)}} (i|s) + L_{\mathrm {(c)}} (i|t) + L_{\mathrm {(i)}} (i|u). \end{aligned}$$

This equality follows from the fact that all the cross-products are zero. Importantly, the orthogonality leads to a great simplification in the clustering, since now the loss function (7) equals

$$\begin{aligned}&L(\mathbf {G}^{(\mathrm {o})}, \mathbf {G}^{(\mathrm {r})}, \mathbf {G}^{(\mathrm {c})}, \mathbf {G}^{(\mathrm {i})}, \varvec{m}, \mathbf {A},\mathbf {B}, \mathbf {C}, \mathbf {D}) \nonumber \\&\quad = JK \sum _{i=1}^{I} \sum _{r=1}^{R} g_{ir}^{(\mathrm {o})} \left\| \left( \tfrac{1}{JK} \varvec{1}_{}^{'}\mathbf {X}_i^{} \varvec{1}_{}^{} - m_r\right) \right\| ^2 \nonumber \\&\qquad + K \sum _{i=1}^{I} \sum _{s=1}^{S} g_{is}^{(\mathrm {r})} \left\| \mathbf {J} \left( \tfrac{1}{K}\mathbf {X}_{i}^{}\varvec{1} - \varvec{a}_s\right) \right\| ^{2} \nonumber \\&\qquad + J \sum _{i=1}^{I} \sum _{t=1}^{T} g_{it}^{(\mathrm {c})} \left\| \left( \tfrac{1}{J}\varvec{1}_{J}^{'}\mathbf {X}_{i}^{} - \varvec{b}_{t}^{'}\right) \mathbf {J}_{}^{} \right\| ^{2} \nonumber \\&\qquad + \sum _{i=1}^{I} \sum _{u = 1}^{U} g_{iu}^{(\mathrm {i})} \left\| \mathbf {J}_{}^{} \left( \mathbf {X}_i^{} - \mathbf {C}_{u}^{} \mathbf {D}_{u}^{'}\right) \mathbf {J}_{}^{} \right\| ^2 \nonumber \\&\quad = L_{(\mathrm {o})}\big (\mathbf {G}^{(\mathrm {o})}, \varvec{m}\big ) + L_{(\mathrm {r})}\big (\mathbf {G}^{(\mathrm {r})}, \mathbf {A}\big ) + L_{(\mathrm {c})}\big (\mathbf {G}^{(\mathrm {c})}, \mathbf {B}\big ) + L_{(\mathrm {i})}\big (\mathbf {G}^{(\mathrm {i})}, \mathbf {C}, \mathbf {D}\big ). \end{aligned}$$

The main consequence of (13) is that the joint clustering reduces to separate clusterings on the grand means, row main effects, column main effects and interactions respectively. This implies that each of these four parts can be treated independently under this loss function and model. It also gives mathematical justification for the procedure whereby all \(\mathbf {X}_{i}\) are first double-centred to remove the overall, row and column margins, and then analyzed by minimizing \(L_{(\mathrm {i})}\big (\mathbf {G}^{(\mathrm {i})}, \mathbf {C}, \mathbf {D}\big )\) to study the row–column interactions. If the researchers are interested in the grand mean, row or column marginal effects, these can be analyzed separately by minimizing the corresponding loss functions. These are frequently omitted when only the row–column interactions are of interest; otherwise, the researcher may prefer to jointly cluster on more than one of these effects at the same time (see Sect. 6.1).

Next, we discuss computational aspects of the proposed lsbclust method.

5 Estimation algorithms

Due to the form of the loss function (13), we can treat each of the components separately. Conveniently, the loss functions \(L_{(\mathrm {o})}\big (\mathbf {G}^{(\mathrm {o})}, \varvec{m}\big )\), \(L_{(\mathrm {r})}\left( \mathbf {G}^{(\mathrm {r})}, \mathbf {A}\right) \) and \(L_{(\mathrm {c})}\left( \mathbf {G}^{(\mathrm {c})}, \mathbf {B}\right) \) are specific instances of the well-known k-means loss function (e.g., Everitt et al. 2011). They differ only with respect to the data matrix \(\mathbf {Y}: I \times d\) (say) on which k-means cluster analysis is to be applied, which can respectively be defined as follows:

  • For minimizing \(L_{(\mathrm {o})}\big (\mathbf {G}^{(\mathrm {o})}, \varvec{m}\big )\), \(\mathbf {Y}\) has a single column (\(d=1\)) containing the overall means \(\tfrac{1}{JK}\varvec{1}_{}^{'}\mathbf {X}_i^{} \varvec{1}\) of the \(\mathbf {X}_{i}\) (\(i = 1, \ldots , I\));

  • For minimizing \(L_{(\mathrm {r})}\left( \mathbf {G}^{(\mathrm {r})}, \mathbf {A}\right) \), the rows of \(\mathbf {Y}\) (\(d = J\)) consist of the row mean vectors \(\tfrac{1}{K} \mathbf {J} \mathbf {X}_{i}^{}\varvec{1}\) (\(i = 1, \ldots , I\)); and

  • For minimizing \(L_{(\mathrm {c})}\left( \mathbf {G}^{(\mathrm {c})}, \mathbf {B}\right) \), the rows of \(\mathbf {Y}\) (\(d = K\)) are the column mean vectors \(\tfrac{1}{J} \mathbf {J} \mathbf {X}_{i}^{'} \varvec{1}\) (\(i = 1, \ldots , I\)).

Hence optimizing \(L_{(\mathrm {o})}\big (\mathbf {G}^{(\mathrm {o})}, \varvec{m}\big )\), \(L_{(\mathrm {r})}\big (\mathbf {G}^{(\mathrm {r})}, \mathbf {A}\big )\) and \(L_{(\mathrm {c})}\big (\mathbf {G}^{(\mathrm {c})}, \mathbf {B}\big )\) can resort to standard methods for k-means on the overall mean, row margins and column margins respectively. Also, there are a variety of tools available for selecting RS and T.

figure a

Minimizing \(L_{(\mathrm {i})}\big (\mathbf {G}^{(\mathrm {i})}, \mathbf {C}, \mathbf {D}\big )\) however requires a custom algorithm. The main steps of our algorithm, which is based on block-relaxation methods (see, for example, de Leeuw 1994), is outlined in Algorithm 1. It iterates over optimizing \(\mathbf {C}\) and \(\mathbf {D}\) in Step 2b while keeping \(\mathbf {G}^{(\mathrm {i})}\) fixed, and vice versa in Step 2c. This approach is also known as alternating least squares (ALS).

Convergence is reached when no reassignment of a single observation to a different cluster will reduce the value of the loss function. This corresponds to \(\Delta _k = 0\) in Algorithm 1. The algorithm is guaranteed to converge monotonically, but only to some accumulation points which are usually local minima. It must be initialized by a starting configuration for \(\mathbf {G}^{(\mathrm {i})}\). To increase the likelihood of locating the global minimum, it is advisable to use multiple (random) starting values for \(\mathbf {G}^{(\mathrm {i})}\).

We now describe the derivation of the key steps of Algorithm 1. Step 2b, where \(L_{(\mathrm {i})}\big (\mathbf {G}^{(\mathrm {i})}, \mathbf {C}, \mathbf {D}\big )\) is minimized over \(\mathbf {C}\) and \(\mathbf {D}\) for fixed \(\mathbf {G}^{(\mathrm {i})}\), relies on the decomposition

$$\begin{aligned}&L_{(\mathrm {i})} \big (\mathbf {G}^{(\mathrm {i})}, \mathbf {C}, \mathbf {D}\big ) \nonumber \\&\quad = \sum _{i=1}^{I} \sum _{u = 1}^{U} g_{iu}^{(\mathrm {i})} \left\| \mathbf {J} \left( \mathbf {X}_i^{} - \mathbf {{\overline{X}}}_u^{} \right) \mathbf {J} \right\| ^2 + \sum _{u=1}^{U} I_{u} \left\| \mathbf {J} \left( \mathbf {{\overline{X}}}_u^{} - \mathbf {C}_{u}^{} \mathbf {D}_{u}^{'} \right) \mathbf {J} \right\| ^2. \end{aligned}$$

Here \(I_{u}^{} = \sum _{i=1}^{I} g_{iu}^{(\mathrm {i})}\) is the cardinality of cluster u, and \(\mathbf {{\overline{X}}}_{u}^{} = \tfrac{1}{I_{u}^{}} \sum _{i=1}^{I} g_{iu}^{(\mathrm {i})} \mathbf {X}_{i}^{}\) is the cluster mean.

Since only the final term in (14) depends on \(\mathbf {C}\) and \(\mathbf {D}\), it is sufficient to minimize this term only. Let the singular value decomposition (SVD) of \(\mathbf {J} \mathbf {{\overline{X}}}_u^{} \mathbf {J}\) be \(\mathbf {U}_{u} \varvec{\Gamma }_{u} \mathbf {V}_{u}^{'} \), where \(\mathbf {U}_u \) and \(\mathbf {V}_u\) are orthonormal and \(\varvec{\Gamma }_u\) diagonal. The Eckart–Young theorem (Eckart and Young 1936; Schmidt 1907) establishes the best rank-P least-squares approximation of \(\mathbf {J} \mathbf {{\overline{X}}}_u^{} \mathbf {J}\) as the truncated SVD \( \mathbf {U}_{u} \varvec{\Gamma }_{u} \mathbf {L}_{P}^{} \mathbf {V}_{u}^{'}\), where

$$\begin{aligned} \mathbf {L}_{P} = \begin{bmatrix} \mathbf {I}_{P} &{}\quad \mathbf {0} \\ \mathbf {0} &{}\quad \mathbf {0} \end{bmatrix}. \end{aligned}$$

Multiplication by \(\mathbf {L}_{P}\) sets all singular values except the first P equal to zero. Consequently, we can update \(\mathbf {C}\) and \(\mathbf {D}\) using

$$\begin{aligned} \mathbf {J} \mathbf {C}_{u}^{}&= \mathbf {U}_{u}^{} \varvec{\Gamma }_{u}^{\alpha } \mathbf {L}_P^{} \nonumber \\ \mathbf {J} \mathbf {D}_{u}^{}&= \mathbf {V}_{u}^{} \varvec{\Gamma }_{u}^{1 - \alpha } \mathbf {L}_P^{}, \end{aligned}$$

where \(\varvec{\Gamma }^{\alpha }\) denotes the diagonal matrix containing the singular values to the power \(\alpha \). The parameter \(0 \le \alpha \le 1\) is typically taken to be 0.5, but can be set by the user to improve the interpretability of the graphical output. See Sect. 6.3 for a description of the heuristic rule used in our implementation.

Now in Step 2c, \(\mathbf {G}^{(\mathrm {i})}\) is updated while regarding \(\mathbf {C}\) and \(\mathbf {D}\) as fixed. The updated \(\mathbf {G}^{(\mathrm {i})}\) is constructed by simply assigning each i to the cluster with the closest mean, hence minimizing \(L_{(\mathrm {i})}\big (\mathbf {G}^{(\mathrm {i})}, \mathbf {C}, \mathbf {D}\big )\) for each individual in a greedy manner. This entails assigning observation i to cluster

$$\begin{aligned} {\arg \min }_{u=1}^U \left\| \mathbf {J} \left( \mathbf {X}_{i}^{} - \mathbf {C}_{u}^{} \mathbf {D}_{u}^{'} \right) \mathbf {J} \right\| ^2. \end{aligned}$$

A few details remain. When an empty cluster is encountered, that cluster is reinitialized using the worst-fitting observation across all clusters. Label-switching is countered by reassigning cluster labels between iterations when necessary. Finally, we prefer reporting the minimized value of \(L_{(\mathrm {i})} \big (\mathbf {G}^{(\mathrm {i})}, \mathbf {C}, \mathbf {D}\big )\) after division by \(\sum _{i = 1}^{I}\left\| \mathbf {J} \mathbf {X}_{i}^{} \mathbf {J} \right\| ^2\), since this standardized value lies in [0, 1].

In the next section, we briefly discuss alternative model formulations.

6 Enhancements

Here we first consider accommodating alternative bilinear model specifications in the lsbclust loss function (7). Thereafter, we discuss restricted model formulations for the interaction effects that reduce the number of parameters and aid interpretation. Finally, we briefly consider scaling options used in biplot construction. Some practical guidelines on performing model selection are provided in Sect. 8, where an application is discussed.

6.1 Using other bilinear decompositions

In certain situations, a different bilinear model may be more appropriate than the one used in the loss function (7). For example, the researcher may not want to treat the grand, row and column means separately, in which case a more appropriate loss function would simply be

$$\begin{aligned} L(\mathbf {G}^{(\mathrm {i})}, \mathbf {C}, \mathbf {D}) = \sum _{i,u} g_{iu}^{(\mathrm {i})} \left\| {\mathbf {X}_i -\mathbf {C}_{u}^{}\mathbf {D}_{u}^{'}}\right\| ^2, \end{aligned}$$

with the only constraint required being orthogonality of the columns of \(\mathbf {C}_u\) and \(\mathbf {D}_u\). Yet a situation may arise where the row means (but not the column means) are themselves interesting, leading instead to the following loss function:

$$\begin{aligned} L(\mathbf {G}^{(\mathrm {r})}, \mathbf {G}^{(\mathrm {i})}, \mathbf {A}, \mathbf {C}, \mathbf {D}) = \sum _{i,s,u} g_{ir}^{(\mathrm {o})} g_{is}^{(\mathrm {r})} \left\| {\mathbf {X}_i - \left( \varvec{a}_{s}^{}\varvec{1}_{}^{'} + \mathbf {C}_{u}^{}\mathbf {D}_{u}^{'} \mathbf {J}\right) }\right\| ^2. \end{aligned}$$

Here the centering matrix enforces sum-to-zero constraints on the columns of \(\mathbf {D}_u\), which allows \(\varvec{a}_s\) to be estimated.

In fact, a variety of different bilinear models can be specified by dropping one or more constraints, and hence cluster sets, from the formulation (7). An exhaustive list of the nine possibilities are given in Table 2, with Model 9 being the original formulation in (7). With the exception of Model 6, these models are orthogonal as in Sect. 4. Hence the algorithms in Sect. 5 also apply for these bilinear models (except Model 6), after minor adjustments to account for the different centering options.

In Table 2, and in our software, we characterize the different models using a vector of four binary indicators, \(\varvec{\delta }^{'} = (\delta _1, \delta _2, \delta _3, \delta _4)\). Each of these correspond to the presence or absence of one of the centering matrices in the model formulation. Specifically, \(\delta _1 = 1\) and \(\delta _2 = 1\) indicate centering the columns of \(\mathbf {C}_u\) and \(\mathbf {D}_u\) respectively (as in \(\mathbf {J} \mathbf {C}_{u}\) and \(\mathbf {J} \mathbf {D}_u\)), while \(\delta _3 = 1\) and \(\delta _4 = 1\) in addition implies respectively centering the column and row means—as in \(\mathbf {J} \varvec{b}_t\) and \(\mathbf {J}\varvec{a}_{s}\). Additionally, it is only possible to have \(\delta _3 = 1\) if \(\delta _1 = 1\); an equivalent relationship holds between \(\delta _4\) and \(\delta _2\).

Table 2 A summary of the models implied by different choices of \(\varvec{\delta }\) in the generalized lsbclust formulation

6.2 Common row or column coordinates

The formulation in (7) can contain a large number of parameters. This is mainly because the interaction approximations \(\mathbf {C}_{u}^{}\mathbf {D}_{u}^{'}\) require the row and column representations, \(\mathbf {C}_u\) and \(\mathbf {D}_u\) respectively, to be different for each cluster. We can counter this by restricting either the rows or columns to have a common representation across clusters, which has the added benefit of making the biplots based on \(\mathbf {C}_u\) and \(\mathbf {D}_u\) easier to interpret. We see this interpretability as the main reason to elect such a constraint; in a practical application this must be weighed against the resulting reduction in goodness-of-fit.

Consequently, we allow the following three options for modelling the interactions:


\(\mathbf {C}_{u}^{}\mathbf {D}_{u}^{'}\): both \(\mathbf {C}_u\) and \(\mathbf {D}_u\) are specific to the interaction cluster (as above); or


\(\mathbf {C}_{1}^{}\mathbf {D}_{u}^{'}\): a common row representation \(\mathbf {C}_{1}\) for all interaction clusters, and a differential column representation \(\mathbf {D}_u\) for each interaction cluster; or


\(\mathbf {C}_{u}^{}\mathbf {D}_{1}^{'}\): differential row representations \(\mathbf {C}_u\) but a common column representation \(\mathbf {D}_{1}\).

If the alternative specifications (II) or (III) are used, \(\mathbf {C}_{u}^{}\mathbf {D}_{u}^{'}\) in (7) should be replaced by \(\mathbf {C}_{1}^{}\mathbf {D}_{u}^{'}\) or \(\mathbf {C}_{u}^{}\mathbf {D}_{1}^{'}\) respectively. These restricted formulations also require adjustments to Algorithm 1. To facilitate this, we define the following block matrices by stacking either row- or column-wise:

$$\begin{aligned} \mathbf {C}_{*}^{}&= \begin{bmatrix} \sqrt{I_{1}} \, \mathbf {C}_{1}^{'} \mathbf {J}&\sqrt{I_{2}} \, \mathbf {C}_{2}^{'} \mathbf {J}&\cdots&\sqrt{I_{U}} \, \mathbf {C}_{U}^{'} \mathbf {J} \end{bmatrix}^{'}\!\!; \nonumber \\ \mathbf {D}_{*}&= \begin{bmatrix} \sqrt{I_{1}^{}} \, \mathbf {D}_{1}^{'} \mathbf {J}_{}^{}&\sqrt{I_{2}^{}} \, \mathbf {D}_{2}^{'} \mathbf {J}_{}^{}&\cdots&\sqrt{I_{U}^{}} \, \mathbf {D}_{U}^{'} \mathbf {J}_{}^{} \end{bmatrix}^{'}\!\!; \nonumber \\ \mathbf {{\overline{X}}}_{(c)}^{}&= \begin{bmatrix} \sqrt{I_1^{}}\, \mathbf {{\overline{X}}}_{1}^{} \mathbf {J}_{}^{}&\sqrt{I_2^{}}\, \mathbf {{\overline{X}}}_{2}^{} \mathbf {J}_{}^{}&\cdots&\sqrt{I_{U}^{}}\, \mathbf {{\overline{X}}}_{U}^{} \mathbf {J}_{}^{} \end{bmatrix}; \nonumber \\ \mathbf {{\overline{X}}}_{(r)}^{}&= \begin{bmatrix} \sqrt{I_1^{}}\, \mathbf {{\overline{X}}}_{1}^{'} \mathbf {J}&\sqrt{I_2^{}}\, \mathbf {{\overline{X}}}_{2}^{'} \mathbf {J}&\cdots&\sqrt{I_{U^{}}}\, \mathbf {{\overline{X}}}_{U}^{'}\mathbf {J} \end{bmatrix}^{'}\!\!. \end{aligned}$$

The final term in (14) can then be rewritten in the respective formulations as

$$\begin{aligned} \left\| \mathbf {J}\left( \mathbf {{\overline{X}}}_{(c)}^{} - \mathbf {C}_{1}^{} \mathbf {D}_{*}^{'} \right) \right\| ^2&\quad \text {in case (II), or} \\ \left\| \left( \mathbf {{\overline{X}}}_{(r)}^{} - \mathbf {C}_{*}^{} \mathbf {D}_{1}^{'} \right) \mathbf {J} \right\| ^2&\quad \text {in case (III)}. \end{aligned}$$

Step 2b in Algorithm 1 is subsequently amended to perform a single SVD on either \(\mathbf {J}\mathbf {{\overline{X}}}_{(c)}^{}\) to update \(\mathbf {C}_{1}^{}\) and \(\mathbf {D}_{*}^{}\) under (II), or on \(\mathbf {{\overline{X}}}_{(r)}^{} \mathbf {J}\) to update \(\mathbf {C}_{*}^{}\) and \(\mathbf {D}_{1}^{}\) under (III). From these, updates to \(\mathbf {J} \mathbf {D}_{u}^{}\) or \(\mathbf {J} \mathbf {C}_{u}^{}\) are derived for \(u = 1, \ldots , U\) by extracting the relevant block matrices from \(\mathbf {D}_{*}^{}\) or \(\mathbf {C}_{*}^{}\) respectively.

6.3 Biplot interpretability

Under case (I), where there is no requirement for the interaction decompositions to be similar across clusters, it can aid interpretation to rotate the configurations so that the biplot axes lie more or less in the same direction. For any orthogonal matrix \(\mathbf {Q}_{u}\), it holds for the inner product matrices that \(\mathbf {C}_{u}^{}\mathbf {D}_{u}^{'} = \left( \mathbf {C}_{u}^{} \mathbf {Q}_{u}^{} \right) \left( \mathbf {D}_{u}^{} \mathbf {Q}_{u}^{} \right) ^{'} \), and hence these are invariant to orthogonal rotations. The problem of finding orthogonal matrices \(\mathbf {Q}_{u}, u = 1, 2, \ldots , U\), such that either the row or column configurations match each other as closely as possible is known as the generalized orthogonal Procrustes problem (Gower 1975; Gower and Dijksterhuis 2004). We apply this by default in our software implementation.

Besides this adjustment, two types of scalings can be used to make the biplot displays more attractive, namely so-called \(\alpha \)- and \(\lambda \)-scaling. First, since our choice of \(\alpha \) in (16) does not change the inner product approximations, we are free to choose it such that the resulting biplots are easy to interpret. In our software implementation we use as a heuristic method the value of \(\alpha \) which maximizes the minimum Euclidean distance over all row and column points to the origin. Alternatively, the user can choose any other quantile of these distances, such as the median, or specify the desired value of \(\alpha \) explicitly.

Second, note that for matrices \(\mathbf {A}\) and \(\mathbf {B}\) it holds that \(\mathbf {A} \mathbf {B}^{'} = (\lambda \mathbf {A})(\mathbf {B}^{'} / \lambda )\), so that \(\lambda \) can also be freely chosen. Following Gower et al. (2011, Section 2.3.1), we choose \(\lambda \) such that the average squared Euclidean distances from the two sets of points represented by the rows of the matrices in (16) to the origin are equal. For case (I) in (16), for example, this amounts to choosing

$$\begin{aligned} \lambda = \left( \frac{J \Vert \mathbf {V} \varvec{\Gamma }_{u}^{1-\alpha } \mathbf {L} \Vert ^{2}}{K \Vert \mathbf {U} \varvec{\Gamma }_{u}^{\alpha } \mathbf {L}\Vert ^{2}} \right) ^{1/4} = \left( \frac{J {{\,\mathrm{tr}\,}}\varvec{\Gamma }_{u}^{1-\alpha } \mathbf {L} }{K {{\,\mathrm{tr}\,}}\varvec{\Gamma }_{u}^{\alpha } \mathbf {L}} \right) ^{1/4}. \end{aligned}$$

The appendix in the supplemental materials provides information on goodness-of-fit indices to quantify the quality of the within-cluster approximations.

Next, we report the results of a stimulation study.

7 Simulation study

A simulation study was conducted to assess cluster membership recovery of lsbclust. Since the separate clustering problems for the overall mean, row means and column means are simply k-means problems, we focus on reporting the results for the row–column interactions. Simulation studies for ordinary k-means can be found, for example, in Milligan (1980).

The results of lsbclust are compared to that of T3clus, as well as to two different versions of k-means clustering. The first version of k-means, which we will call vectorized k-means or vecKmeans, merely vectorizes the matrix slices by stringing them out as long vectors and then applies ordinary k-means to the matrix having these vectors as rows. The second variation first obtains the best P-dimensional approximation to each matrix slice using the SVD, and then applies vecKmeans. This variant we call dimKmeans. Both these methods are to be compared to the lsbclust interaction clustering, hence the matrix slices \(\mathbf {X}_i\) are double-centred before being submitted to these procedures.

Simulated data from the lsbclust model are constructed according to the following steps:

  1. 1.

    Simulate the cluster membership matrix \(\mathbf {G}^{(\mathrm {i})}\) given the required number of clusters U and the number of observations I. The proportion of observations attributed to each class, \(\pi _{u}, u = 1, \ldots , U\), must be specified. Similarly, generate \(\mathbf {G}^{(\mathrm {o})}\), \(\mathbf {G}^{(\mathrm {r})}\), \(\mathbf {G}^{(\mathrm {c})}\) with RS and T clusters respectively. The same cluster membership probabilities are used for these clusters.

  2. 2.

    Simulate overall means \(m_r, r = 1, \ldots , R\), from a standard normal (Gaussian) distribution, as well as row means \(\varvec{a}_s, s = 1, \ldots , S\), and column means \(\varvec{b}_t, t = 1, \ldots , T\). These are also drawn from standard normal distributions, and subsequently centred.

  3. 3.

    Simulate matrices \(\overline{\mathbf {X}}_u, u = 1, \ldots , U,\) representing the cluster means in (14), as follows:

    1. (a)

      Generate two random orthogonal matrices using Stewart (1980)’s method, representing the rows and columns in P-dimensional space. Column-centre both these matrices to arrive at \(\mathbf {U}_{u}\) and \(\mathbf {V}_{u}\), say. If case (a) or (b) applies, either the same \(\mathbf {U}_{u} = \mathbf {U}\) or \(\mathbf {V}_{u} = \mathbf {V}\) is used for all clusters.

    2. (b)

      Generate P singular values as random numbers on [0.5, 5], and sort these in decreasing order in the vector \(\varvec{\gamma }_u\). If case (a) or (b) applies, the same singular values \(\varvec{\gamma }_u = \varvec{\gamma }\) are used for all clusters.

    3. (c)

      Construct \(\varvec{\Gamma }_u = {{\,\mathrm{diag}\,}}\left( \varvec{\gamma }_u\right) \) and consequently \(\overline{\mathbf {X}}_u^{} = \mathbf {U}_{u}^{}\varvec{\Gamma }_u^{}\mathbf {V}_{u}^{'}\).

  4. 4.

    Finally, construct the simulated \(\underline{\mathbf {X}}\) by using the relevant simulated cluster means of each type for each matrix slice, and by adding random noise simulated from a normal (Gaussian) distribution with zero mean and standard error \(\sigma \). Here \(\sigma \) controls the signal-to-noise ratio: larger \(\sigma \) makes it harder to estimate the parameters used in the simulation steps.

In our simulation study, the following factors were varied:

  • The number of observations, \(I \in \{100, 500\}\);

  • The interaction decomposition: (a) \(\mathbf {C}_{1}^{} \mathbf {D}_{u}^{'}\) (rows fixed across clusters), (b) \(\mathbf {C}_{u}^{} \mathbf {D}_{1}^{'}\) (columns fixed) or (c) \(\mathbf {C}_{u}^{} \mathbf {D}_{u}^{'}\) (neither fixed);

  • The error standard deviation, \(\sigma \in \{0.5, 1, 1.5\}\);

  • The cluster proportions, \(\varvec{\pi }_{u}\), being either balanced (\(\varvec{\pi }_{u}^{'} = (0.2, 0.2, 0.2, 0.2, 0.2)\)) or unbalanced (\(\varvec{\pi }_{u}^{'} = (0.1, 0.15, 0.2, 0.25, 0.3)\)); and

  • The dimensionality of the interaction decomposition, \(P \in \{2, 5\}\).

This translates to a \(3^2 \times 2^3\) design, with 72 conditions. For simplification, the following were kept fixed: the model simulated from, namely model (9)—see Table 2; the size of the matrix slices, \(J = K = 8\); and the number of clusters, \(R = S = T = U = 5\).

For each of the 72 conditions, we generate 100 parameter sets. In turn, for each of these parameter sets, we generate 50 randomly sampled data sets, resulting in 5000 simulated data sets per condition, or 360,000 in total. The results can be assessed both on clustering quality and estimation accuracy, where the latter includes clustering quality as well as parameter recovery. Here we discuss clustering quality only.

7.1 Clustering quality

Cluster recovery is measured by the adjusted Rand index (ARI; Hubert and Arabie 1985), which in our case quantifies the similarity between the actual, simulated clustering and that recovered by an algorithm. It improves on simple cluster agreement by adjusting for the chance of a randomly chosen pair of subjects falling in the same class. The ARI takes a value of one when the cluster recovery is perfect, and zero when the algorithm performs similarly to random assignment. The ARI can also take negative values, which indicate worse performance than random assignment. We first report the performance profiles of the ARI, which assesses how well each method performs relative to all the other methods. Thereafter, we consider the absolute performance of the methods.

7.1.1 Performance profiles

Performance profiles are used to compare multiple algorithms on a chosen performance metric (Dolan and Moré 2002; van den Burg and Groenen 2016). The basic idea is to express the performance of each algorithm relative to that of the best performing algorithm on each particular data set, and then to plot the cumulative distribution of this relative performance. Denote by \({\mathcal {D}}\) the set of data sets and by \({\mathcal {C}}\) the set of algorithms. Suppose that \(p_{d,c}\) is the ARI for \(d \in {\mathcal {D}}\) and \(c \in {\mathcal {C}}\). The performance ratio \(r_{d,c}\) is then defined as the ratio of the best performing ARI on data set d to \(p_{d,c}\):

$$\begin{aligned} r_{d,c} = \frac{\max _{c \in C} p_{d,c}}{p_{d,c}}. \end{aligned}$$

Typically, the best performing method has a performance ratio of one, with other methods having larger performance ratios, indicating how close these methods came to the best method. However, for the ARI, there is one caveat: the ARI may be negative or even exactly zero, in which case (21) do not work as intended. We circumvent this problem by adding a small positive constant to both the numerator and denominator in (21).

Fig. 1
figure 1

Performance profiles based on the adjusted Rand index, for \(P = 2\) dimensions and a sample size of \(I = 100\)

Fig. 2
figure 2

Performance profiles based on the adjusted Rand index, for \(P = 2\) dimensions and a sample size of \(I = 500\)

The performance profile for algorithm c is simply the empirical cumulative distribution function (ecdf) of the performance ratios. This can be calculated as

$$\begin{aligned} P_c (\nu ) = \frac{|\{d \in {\mathcal {D}}: v_{d,c} \le \nu \}|}{|{\mathcal {D}}|} \end{aligned}$$

where \(|\cdot |\) denotes the cardinality of a set. Therefore, \(P_c(\nu )\) is simply the empirical probability of algorithm c having an performance ratio of at most \(\nu \), and by extension \(P_c(1)\) is the empirical probability that algorithm c achieves the best performance.

We calculate performance profiles for each combination of the five factors manipulated in the simulation study. The results are shown in Figs. 1 and 2. Both of these figures pertain to \(P = 2\) dimensions, with Figs. 1 and 2 purporting to data sets with \(I = 100\) and \(I = 500\) observations respectively. For the sake of brevity, we omit the figures for \(P = 5\) dimensions: in these cases, all algorithms achieve close to optimal results.

Table 3 The mean adjusted Rand indices for the different methods, averaged over data and parameter sets

The results in Figs. 1 and 2 show that lsbclust generally outperforms the other methods, since its performance profiles in general are larger than that of the other methods. The next best method is vecKmeans, with T3clus coming in last. It should be expected that lsbclust should perform well, since it was used to generate the data. vecKmeans is quite competitive when interaction decomposition (c) is used (where neither component matrices are fixed across clusters). When the actual model do include restrictions on the interaction decomposition, lsbclust performs much better than the other methods.

In terms of factors manipulated in the study, the error standard deviation (\(\sigma \)), the sample size I and the interaction decomposition is most important. Whether the clusters are balanced or not has very little bearing on the results.

Having assessed the relative performance of the methods, we turn our attention to the absolute ARI achieved on the simulated data.

7.1.2 Adjusted Rand index

Table 3 reports the average ARI for the different methods. We first calculate the average ARI for the 50 data sets generated for each set of parameters, and then average the resultant 100 average ARI’s over the 100 different parameter sets. The table only includes results for \(P = 2\) dimensions, and for the case where the cluster sizes are balanced. The latter does not affect the results significantly, so has been omitted. The former does have an important but uninteresting effect: when \(P = 5\), all methods achieved near optimal ARI. Overall, an increase in the error standard deviation degrades the model performance the most. With enough samples, lsbclust can however still achieve decent clustering performance when \(\sigma \) is high.

In the next section, we consider an illustrative empirical example.

8 Application

The data comes from a brand positioning study where 187 consumers evaluated 10 car manufacturers on a set of 8 attributes (Bijmolt and van de Velden 2012). It was collected via an online survey using a representative Dutch sample from the CentERpanel of Tilburg University in the Netherlands, with the aim of studying how consumers perceive different car brands. The data \(\underline{\mathbf {X}}\) is therefore of size \(187 \times 10 \times 8\).

The 10 international car brands considered were: Citroën, Fiat, Ford, Opel, Peugeot, Renault, Seat, Toyota, Volkswagen and Volvo. Respondents rated each of these brands on 8 different attributes using a 10-point rating scale. For 6 out of the 8 items, namely Affordability, Attractiveness, Safety, Sportiness, Reliability and Features, a score of 10 is the most desirable outcome. However, for the items Size and Operating Cost, a score of 10 reflects small cars and those with high operating costs, respectively. Hence, higher ratings on these two attributes reflect increasingly negative sentiment, in contrast to the remaining six items.

We fit an lsbclust model with \(\varvec{\delta } = (1, 1, 1, 1)\)—Model 9 in Table 2—so that the overall means, row means, column means and interactions are estimated separately. Also, we select \(P = 2\) dimensions for display purposes, and we fix the coordinates of the 10 car brands across all interaction biplots—using case (II) from Sect. 6.2—to aid with interpretation.

Besides these choices, the number of clusters for each of the four components must be determined. For simplicity, we opt to do this separately for each of the four sets of clusters, although it is possible to make a joint selection. Selecting the number of clusters is a common problem, and many procedures and criteria have been proposed in the literature. Milligan and Cooper (1985), Hardy (1996) and Everitt et al. (2011) provide an assessment of some of these criteria and additional references. The simplest approach, and the one we use here for illustration, is probably the scree test (Cattell 1966). This method involves running the algorithm for several values of k and plotting the loss function against k. The user must then choose a value for k based on this so-called scree plot, such that the chosen k is close to an “elbow” in the plot. This indicates that adding additional groups to the analysis does not significantly increase how well the results describe the data.

Here we fit lsbclust models for 1 to 15 clusters and inspect the resulting scree plots to select RST and U.Footnote 1 Based on these plots, we selected \(R = 5\), \(S = 5\), \(T = 6\) and \(U = 8\) clusters. The number of row clusters was reduced to \(S = 5\) from an initial choice of \(S = 8\) to avoid clusters contain a single observation only. We note that these choices are subjective, should take into account the aims of the research and that alternative selection criteria can also be used. The number of random starts used for the interaction and k-means clustering were 100 and 1000 respectively.

The cluster sizes and centres (i.e., \(m_r\) in Eq. (7)) for each of the five overall mean clusters are shown in Table 4. Noticeable here are that the small clusters O4 (8 observations) and O5 (3 observations) identified persons who invariably used very high and very low scores, respectively. These respondents obviously do not provide very interesting information in their answers. But since their corresponding row means, column means and interactions do not differ from the overall mean, they are merely assigned to the row, column and interaction segments containing negligible effects (see below). lsbclust has therefore been able to identify the 11 persons in clusters O4 and O5 who provide very little sensible information.

Figure 3 displays the means of the eight car brand (row) clusters across all attributes. Effect sizes can be read off on the horizontal axis. The first two clusters, Segments R1 (98 observations) and R2 (59 observations), contain no pronounced large effects, which indicates that these consumers do not have strong, consistent opinions on any of the brands across all attributes. The 23 observations in Segment R3 do assign lower scores to Volvo and somewhat higher scores to Peugeot across all items. These effects amount to (approximately) -1.4 and 0.7 for Volvo and Peugeot, respectively. Even larger effects are observed in the smaller segments R4 (4 observations) and R5 (3 observations), but these comprise a relatively small part of the data.

Table 4 Segments based on overall mean ratings as detected in the cars data
Fig. 3
figure 3

Car manufacturer (row) cluster means detected in the cars data. The size of the effects can be read off from the horizontal axis

Fig. 4
figure 4

Attribute (column) cluster means detected in the cars data

Fig. 5
figure 5

Biplots for the interaction clusters I1 to I4 detected in the cars data. The relative cluster sizes are displayed in the titles of the different panels. Each attribute is represented by a vector, and those which fit best in each panel also by calibrated axes. The car manufacturers are represented by points, which has identical locations across panels. The orthogonal projections of the car manufacturer points onto the attribute axes give the estimated mean effects. The colors and labels are faded according to how well they fit into the display: solid colors fit well and transparent ones fit badly

Fig. 6
figure 6

Biplots for the interaction clusters I5 to I8 detected in the cars data

The attribute (column) mean effects for all six clusters are displayed in Fig. 4. Here all segments contain at least 13 observations. Again, the largest cluster, Segment C1 (69 observations), contains no large effects. Segments C2 (35 observations), C3 (33 observations), and C4 (22 observations) are similar in that they display an inclination to assign higher scores on Reliability and Safety, and lower scores on Affordability, irrespective of the car brand being assessed. The magnitude of these effects vary greatly over these clusters though, and Segments C2 and C4 also display negative effects for Size. Segment C5 (15 observations) is dominated by a tendency to assign low scores on Affordability. Several large effects are seen in Segment C6 (13 observations). These indicate a generally positive assessment of all brands on Size, Operating Cost, Reliability and Size, bearing in mind that for the former two attributes lower scores are better.

The most interesting results can be found among the interactions, which is where respondents distinguish between different car manufacturers on the measured attributes. Figs. 5 and 6 shows the biplots for the eight interaction segments. The car manufacturers are represented by points, and the attributes by arrows. The labels, points and arrows are shaded according to their goodness-of-fit, with well-fitting points being darker. All car brands, except Ford with a fit of only 0.09, fit reasonably well—see Table 5. The locations of the car brands are fixed across all biplots to make them easier to interpret. It is immediately apparent that the French manufacturers (Peugeot, Citroën and Renault) are judged to be similar, while the German brands Opel and Volkswagen are also located close together. Volvo, the Swedish car manufacturer, is somewhat isolated towards the right of the biplots, in contrast with Fiat at the opposite side of the plot. Fiat and Toyota are judged to be somewhat similar to the French and German brands respectively. Seat in turn are most similar to Toyota. As a result of its low fit, Ford is hardly visible and lies near the origin.

The fit for the eight attributes vary per segment, and is summarized in Table 6. Typically only a subset of items fit well in each segment, and only the better-fitting ones are adorned with calibrated axes in Figs. 5 and 6. For any manufacturer, the estimated within-cluster interaction effects can be read off from the orthogonal projection of its representing point onto the respective biplot axes. For example, Volvo scores approximately 0.25 points above that predicted by the overall mean, row mean and column mean on Sportiness in Segment I1, and Fiat score about 2 rating points above the overall and marginal effects on Affordability in the Segment I6. The overall variance accounted for in approximating the cluster means is 82.8% in two dimensions, with 64.1% and 18.7% attributed to dimensions 1 and 2, respectively. Hence two dimensions are a reasonable choice.

The effects in the interaction clusters should be interpreted as deviations from that expected from the overall and marginal means alone. Here are summaries of the interaction segments:

  • Segment I1 (32.6%) has no large effects. Hence for this sizable group of individuals, the overall and marginal means contain most of the information provided.

  • Segment I2 (12.3%) interprets Fiat and Seat to be more affordable than expected from the overall and marginal effects alone. The opposite applies to Volvo and Volkswagen. In terms of safety and reliability, however, the roles are reversed: Fiat and Seat score lower than expected, while Volvo and Volkswagen excel on these items. The effect sizes are roughly the same on the aforementioned three items.

  • Segment I3 (12.3%) contrasts positive interaction effects on Affordability and Size with negative effects on Safety, Reliability, Attractiveness and Features, and vice versa. Taking into account that Size is reverse-coded, this segment considers Fiat and Seat to be more affordable but smaller cars. At the same time, they are considered less reliable, attractive and safe than especially Volvo and Volkswagen. This segment is not dissimilar to Segment I2, except that effect sizes are larger and the inclusion of Attractiveness and Features on the right side of the plot.

  • Segment I4 (10.7%) contains smaller effects than Segment I3. It distinguishes somewhat between Affordability and Size, in contrast to Segment I3. Whereas Fiat still better than expected in terms of Affordability, Seat is perceived to perform worse than Fiat on Size (after adjusting for the overall and marginal means). There is also now a much bigger difference between Volvo and Volkswagen in terms of size, with the latter scoring worse than expected compared to Renault, Peugeot and Citroën.

  • Segment I5 (10.2%) again interprets the left of the plot as better in terms of Affordability and worse in terms of Size. But it also associates this with lower operating costs (since the latter is reverse-coded). Volvo and Volkswagen are scores higher than expected in terms of safety, in contrast to Seat and Fiat.

  • Segment I6 (8.6%) contains large interaction effects on three pairs of items. Renault, Citroën and Peugeot score higher on Attractiveness and Features than expected. This is in contrast to Seat and Volkswagen, who exhibit negative interaction effects with these items. Safety and Reliability again favour Volkswagen and Volvo, while Affordability and Size have similar but larger effects than in Segment I4.

  • Segment I7 (7.5%) have similar interaction effects for Affordability and Size as with Segments I4 and I5. But these individuals also consider Volkswagen, Volvo and Opel to have higher than expected operating costs, in contrast to Fiat, Renault and Citroën. The latter brands, together with Peugeot are also considered the most attractive, with Volkswagen being considered less attractive than expected using only the overall and main effects.

  • Segment I8 (5.9%) contains individuals who consider Renault, Fiat, Citroën and Peugeot to be more affordable than expected using only the overall and main effects. At the same time, these brands are considered to have higher expected operating costs and smaller cars. The direction of Attractiveness, Safety and Reliability indicate that the combination of these aspects are considered to correlate with higher prices.

Table 5 Brand fit for the cars data across all clusters. Higher values indicate better fit, with a maximum of one and minimum of zero
Table 6 Attribute fit for the cars data, for all eight interaction segments

Clearly, there are strong similarities but also interesting differences in how these groups of individuals interpreted the performance of the brands on the respective items after adjusting for overall and main effects. In the context of this application, these insights can provide valuable input to a brand positioning strategy, for example. Code reproducing our analysis of these data appears in the appendix of the supplemental materials.

9 Conclusions

This paper introduces lsbclust, a modelling framework for three-way data, where one of the three ways is clustered over whilst the corresponding matrix slices are approximated by low-rank decompositions. The clustering is done simultaneously with respect to up to four different aspects of these matrix slices, namely the overall mean responses, the row means, the column means, and the row–column interactions. These are the four elements of the biadditive (or bilinear) model used to approximate each of the matrix slices. Which of these terms are included in the model depends on the choice of identifiability constraints, as parametrized by \(\varvec{\delta }\). We show that in eight out of nine unique choices for \(\varvec{\delta }\), the combination of the bilinear model and least-squares loss allows the four clustering problems to be treated independently. This important property greatly simplifies the complexity of the clustering problem, which also has positive implications for model selection and the interpretation of the results. The low-rank decompositions of the interaction cluster means lead to readily interpretable biplots which aid in the interpretation of the results.

As illustrated in an application, lsbclust is a useful and natural alternative to more traditional three-way matrix decomposition methods such as parafac/candecomp and tuckals3. Our method uses a combination of well-known multivariate statistical methods, namely k-means cluster analysis, low-rank decompositions of two-way matrices as well as biplots, whereas traditional three-way methods require domain-specific expertise. Since least-squares loss functions are used, the problems can be treated very efficiently in software. Such software implementing lsbclust has been developed in the form of an eponymous R (R Core Team 2020) package. The package, lsbclust (Schoonees 2019), is available for download from the Comprehensive R Archive Network (CRAN, http://cran.r-project.org).

There are some points that require further research. The treatment of missing values have not been discussed, and should be investigated in the future. In terms of model selection, a wide variety of alternatives to the scree test can and should be investigated. There are a number of promising methods available in the literature, including using multiple criteria and taking a vote to determine the most attractive choice. We note that the rank of the low-rank decomposition can also be considered as a model selection step. Furthermore, it would be possible to add case weights to the methodology. An advantage of case weights is that it allows a mechanism for implementing the bootstrap (e.g. Efron and Tibshirani 1994) to assess the variability of any given solution.