1 Introduction

In many domains of research, it frequently occurs that the available information pertains to a number of objects on which a set of variables are measured by different subjects (or on different occasions). This information can be stored in a three-way data array or tensor. The three sets of entities (objects, variables, and subjects) associated with such a three-way data array are usually referred to as modes. A three-way array can be seen as a collection of standard two-way, two-mode (objects, variables) matrices, one for each subject. It is obvious that three-way data offer richer information but, at the same time, are more complex than two-way data. It is therefore fundamental to develop new statistical tools able to exploit the potentialities of three-way data. In fact, classical statistical methods might be used to analyze three-way data. For instance, it would be sufficient to juxtapose next to each other the matrices pertaining to the various subjects to obtain a matrix with rows corresponding to objects and columns corresponding to all the combinations of variables and subjects. This matrix can be analyzed by classical methods, but conclusions are usually inadequate because the three-way structure is not properly taken into account, i.e., methods simply ignore that variables are replicated across subjects.

A traditional goal of a three-way analysis involves dimension reduction in order to summarize the observed data through a limited number of components. For this purpose, two popular choices are represented by the Tucker3 (Tucker, 1966) and Candecomp/Parafac (Carroll & Chang, 1970; Harshman, 1970) models. The former assumes different numbers of components for objects, variables, and subjects, and their triple interactions are evaluated in a reduced three-way array called core. The latter is a simpler model, assuming a common number of components and a constrained structure of the core. This implies that, differently from Tucker3, the Candecomp/Parafac solution is essentially unique (Kruskal, 1977).

Although the two models above are widely applied for extracting components from three-way data, they do not allow to discover clusters. In this respect, Tucker3 and Candecomp/Parafac have been generalized to the clustering framework. We can roughly distinguish two main classes of techniques aiming at partitioning the entities referring to a single mode or to more modes simultaneously. A common case is referred to as co-clustering when two or three modes are clustered (see Madeira & Oliveira, 2004, for a comprehensive review of the two-way case, and Papalexakis et al., 2013, for a proposal in the three-way case). In this paper, we are going to focus on clustering a single mode, seeking, without loss of generality, for a partition of objects, which is by far the most common case.

Starting from Krijnen (1993) and Krijnen and Kiers (1993), several clustering models for three-way data can be found in the literature following either a model-based (see, e.g., Viroli, 2011; Gallaugher & McNicholas, 2018, 2020; Melnykov & Zhu, 2018; Silva et al., 2019; Sarkar et al., 2020; Tomarchio et al., 2022) or a least-squares approach (see, e.g., Rocci & Vichi, 2005; Vichi et al., 2007; Wilderjans & Ceulemans, 2013; Wilderjans & Cariou, 2016; Cariou & Wilderjans, 2018; Cariou et al., 2021; Schoonees et al., 2022). Here, the least-squares approach will be followed throughout the paper.

Within such an approach, Wilderjans and Ceulemans (2013) clearly observe that two different strategies can be adopted. The former one consists of models focusing on the between-cluster differences in covariation. In these models, the components extracted within each cluster are different from those extracted within the remaining clusters. Therefore, the assumption is that objects assigned to the same cluster subsume the same component structure, whereas objects belonging to different clusters have different underlying components. The latter strategy concerns models aiming at discovering between-cluster differences in mean, i.e., models where the differences in the mean profiles of the different clusters are explained by components. This can be done by summarizing only variables and subjects through components, whilst objects are partitioned into a reduced number of clusters. It follows that objects are analyzed asymmetrically with respect to variables and subjects. Specifically, objects are assigned to clusters following a K-means-type (MacQueen, 1967) procedure where the centroids lie in the low-dimensional space spanned by the components for the variables and the subjects. Partitioning of objects and dimension reduction of both variables and subjects are performed simultaneously. We refer to Wilderjans and Ceulemans (2013) for further details on the between-cluster differences in covariation and mean.

Examples of models focusing on the between-cluster differences in covariation include Wilderjans and Ceulemans (2013) and Wilderjans and Cariou (2016), among others. Wilderjans and Ceulemans (2013) propose a clusterwise version of the Candecomp/Parafac model, where objects are classified in a limited number of clusters and, simultaneously, the objects assigned to each cluster are modeled through the Candecomp/Parafac model. Wilderjans and Cariou (2016) develop a model, further extended by Cariou and Wilderjans (2018) and Cariou et al. (2021), that boils down to simultaneously group the objects into clusters and apply Candecomp/Parafac with one component for each cluster. In Cariou and Wilderjans (2018), non-negativity constraints are introduced in the estimation process.

Rocci and Vichi (2005) and Vichi et al. (2007) are the most prominent models aiming at discovering between-cluster differences in mean. In these proposals, suitable generalizations of the Tucker3 model are developed by introducing a binary object partition matrix in place of the standard component matrix for the objects. Rocci and Vichi (2005) and Vichi et al. (2007) can also be seen as three-way generalizations of, respectively, the reduced K-means (De Soete & Carroll, 1994) and Factorial K-means (Vichi & Kiers, 2001) methods for standard two-way data where clusters of objects and components for variables are simultaneously determined.

In this paper, we propose a new clustering model for three-way data where objects are assigned to clusters and, simultaneously, variables and subjects are reduced consistently with the latter strategy described above. For ease of interpretation and for the uniqueness property, differently from Rocci and Vichi (2005) and Vichi et al. (2007), which represent the starting point of our work, we consider the Candecomp/Parafac model with one component for each cluster. In this respect, the proposed model is also closely related to Wilderjans and Cariou (2016). However, it is important to stress that the objectives of the two models are different. Namely, our model analyses between-cluster differences in mean because it adopts a K-means strategy where the Candecomp/Parafac-based model part for variables and subjects plays the same role as the centroids do in K-means, whilst Wilderjans and Cariou (2016) analyze between-cluster differences in covariation. As we shall see in detail by some clarifying examples, this implies that only our model is a partitioning method in a strict sense (differences in mean) because Wilderjans and Cariou (2016) identify groups of objects with the same underlying structure (differences in covariation), but this does not necessarily lead to standard clusters distinct in mean.

The paper is organized as follows: In the next section, the novel clustering model is proposed and formalized in a general framework. Some issues related to special cases and model assessment are discussed, together with a theoretical and practical comparison with the existing techniques. The estimation algorithm is presented in Sect. 3. Section 4 contains the results of an extended simulation study carried out in order to evaluate the performance of the proposed model and algorithm, also in comparison with its potential and direct competitors. Some applications to real data are given in Sect. 5. Finally, the last section is devoted to some concluding remarks.

2 The Clustering Model

In this section, we introduce a novel clustering model for three-way data exploiting the potentialities of both the K-means clustering algorithm and the Candecomp/Parafac model. For this purpose, we start from a general K-means-based clustering model for three-way data and, later, we recall the Tucker3 and Candecomp/Parafac models together with T3clus, the model proposed by Rocci and Vichi (2005).

2.1 A General K-Means Based Clustering Model

Let us suppose J variables are measured on N objects by H subjects. Such data are stored in the three-way array X of order (N × J × H), whose generic element is xnjh, expressing the measurement of object n (n = 1, …, N) with respect to variable j (j = 1, …, J) made by subject h (h = 1, …, H). Array X can be seen as a collection of matrices, one for every subject. Therefore, matrix Xh (h = 1, …, H) of size (N × J) is usually referred to as slice and contains all measurements from subject h.

The most general clustering model can be fully specified as follows:

$$\mathbf{X}_{h} = \mathbf{U}_{h}\mathbf{Y}_{h} + \mathbf{E}_{h},$$
(1)

h = 1, …, H, where Eh is the error term for subject h and Uh is the membership matrix of order (N × K) for objects into clusters, being K the number of clusters. Matrix Uh is binary with only one entry equal to 1 per row and identifies a partition of the N objects into K disjoint clusters for subject h (h = 1, …, H). Matrix Yh (h = 1, …, H) of order (K × J) is the so-called subject-specific centroid matrix. The generic k-th row of Yh refers to the centroid of cluster k expressed in terms of the J observed variables. Thus, the model assumes a different partition across the slices related to the subjects. In other words, separate partitions are independently sought by a K-means-type model for every subject.

In order to exploit the three-way structure of the data, i.e., to properly take into account that the same variables are observed on the same objects by the subjects, constrained versions of model (1) can be derived. For instance, we may assume that Uh = U, h = 1, …, H. Then, we get

$$\mathbf{X}_{h} =\mathbf{UY}_{h} +\mathbf{E}_{h},$$
(2)

h = 1, …, H. Matrix U is a membership matrix, subject to the same constraints as Uh. Therefore, model (2) identifies a common partition across subjects. As in model (1), different centroid matrices Yh are assumed, allowing for possible differences among subjects. Therefore, model (2) is a K-means-type model with a consensus partition specified by U.

Let us generally denote by Aa = [A1|A2| …|AH] a supermatrix where matrices Ah (h = 1,…, H) are collected next to each other. Thus, Aa contains all the object-by-variable matrices pertaining to the subjects next to each other. When applied to array X, we obtain a supermatrix Xa of order (N × JH). With obvious notation, model (2) can be equivalently rewritten in terms of Xa as

$$\mathbf{X}_{\mathrm a} =\mathbf{UY}_{\mathrm a} +\mathbf{E}_{\mathrm a}.$$
(3)

In addition, model (3) can be extended by synthesizing the centroids through a limited number of components for the variables and the subjects. In order to describe such extensions, the Tucker3 and Candecomp/Parafac models need to be recalled.

2.2 The Tucker3, T3clus, and Candecomp/Parafac Models

The Tucker3 (T3) model, proposed by Tucker (1966), can be expressed as

$$\mathbf{X}_{\mathrm a}=\mathbf{AH}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a},$$
(4)

where A of order (N × P), B of order (J × Q), and C of order (H × R) are the component matrices for the objects, the variables, and the subjects, respectively, being P (≤ N) the number of object components, Q (≤ J) the number of variable components, and R (≤ H) the number of subject components. Without loss of fit, A, B, and C can be constrained to be columnwise orthonormal. Moreover, Ha is the supermatrix obtained from the so-called core array H of order (× Q × R), expressing the interactions among the object, variable, and subject components. Different numbers of components in A, B, and C can be assumed to deal with different levels of complexity for objects, variables, and subjects. Finally, the symbol ⊗ denotes the Kronecker product.

Rocci and Vichi (2005) propose the T3clus model to simultaneously partition the objects into clusters and reduce variables and subjects by components. It is formalized as

$$\mathbf{X}_{\mathrm a}=\mathbf{UH}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a}.$$
(5)

By comparing (4) and (5), we can see that the main difference relies in the binary matrix U, which replaces the component matrix for the objects A. It follows that in T3clus, the elements of Ha express the interactions among the object clusters, the variable components, and the subject components. Therefore, T3clus is a T3 model with binary constraints on one matrix (A, renamed as U). It is a special case of the so-called NMFA/GENNCLUS model, mentioned by Carroll and Chaturvedi (1995).

As for Tucker3, the T3clus solution is not unique. Specifically, if the component matrices are rotated by arbitrary non-singular rotation matrices, it is sufficient to counterrotate the core supermatrix to obtain an equal fitting solution. In detail, letting R and S be two non-singular rotation matrices of appropriate order, we get the rotated matrices BR = BS and CR = CT and the counterrotated supermatrix HaR = Ha (T–1 ⊗ S–1)′ such that

$$\mathbf{UH}_{\mathrm aR}{(\mathbf{C}_{R}\otimes \mathbf{B}_{R})}^{^{\prime}}=\mathbf{UH}_{\mathrm a}{\left(\mathbf{T}^{-1}\otimes \mathbf{S}^{-1}\right)}^{^{\prime}}{(\mathbf{CT}\otimes \mathbf{BS})}^{^{\prime}}=\mathbf{UH}_{a}{\left(\mathbf{T}^{-1}\otimes \mathbf{S}^{-1}\right)}^{^{\prime}}{\left(\mathbf{T}\otimes \mathbf{S}\right)}^{^{\prime}}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}=\mathbf{UH}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}.$$
(6)

The use of T3 is sometimes prevented by difficulties in interpreting the solution, due to the interactions given by the core. A simpler alternative is represented by the Candecomp/Parafac (CP) model, independently proposed by Carroll and Chang (1970) and Harshman (1970). Although T3 and CP present several distinctive features, the latter model can be seen as a constrained version of the former one with the same number of components for objects, variables and subjects (P = Q = R = K) and a core, denoted by W, which has now order (K × K × K), with only K elements different from zero. In particular, the core is assumed to be superdiagonal, i.e., the nonzero elements are wkkk, k = 1, …, K. This implies that a component in a certain mode can only be related to a single component in another mode. We have

$$\mathbf{X}_{\mathrm a}=\mathbf{AW}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a}.$$
(7)

The component matrices A of order (N × K), B of order (J × K), and C of order (H × K) are not necessarily columnwise orthonormal. Their scaling is often fixed by setting diag(A′A) = diag(B′B) = diag(C′C) = IK, where IK is the identity matrix of order K and diag(∙) denotes the operator that returns the diagonal matrix with the diagonal elements of its argument. If the scaling is not fixed and A, B, and C are fully unconstrained, Wa is superfluous, and the CP model can be equivalently formulated as

$$\mathbf{X}_{\mathrm a}=\mathbf{AI}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a}=\mathbf{A}{(\mathbf{C}\left|\otimes \right| \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a},$$
(8)

where Ia has the same structure as Wa, but its nonzero elements are equal to 1, and the symbol | ⊗| denotes the Khatri-Rao product. The equivalent CP formulations in (7) and (8) may help observe that, differently from T3, the CP solution is unique up to scaling and permuting the components. This implies that the component interpretation is unique. Such a property is highly appreciated in several domains of research.

2.3 The CPclus Model

We can now introduce our proposal. It can be seen as an extension of the CP model in order to simultaneously get a partition of the objects in K clusters and a dimension reduction of variables and subjects through K components. Bearing in mind model (7), we propose the so-called CPclus model that is formalized as

$$\mathbf{X}_{\mathrm a}=\mathbf{UW}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a},$$
(9)

where U of order (N × K) is the binary partition matrix for the objects, B and C are the component matrices for the variables and the subjects, respectively, such that diag(B′B) = diag(C′C) = IK, and Wa is the core supermatrix with non-zero elements denoted by wk (k = 1, …, K). In this way, we fix the scaling of the component matrices so that, differently from T3clus, the CPclus solution is unique (up to cluster/component labeling). As for CP, an equivalent formulation can be derived without fixing the scaling of B and C. In this case, by replacing Wa with Ia, we get

$$\mathbf{X}_{\mathrm a}=\mathbf{UI}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a}=\mathbf U{(\mathbf C\left|\otimes \right| \mathbf B)}^{^{\prime}}+\mathbf{E}_{\mathrm a}.$$
(10)

As we shall see, the expression in (10) will be used for estimation purposes. In general, we prefer to consider the CPclus formulation in (9) by assuming that the scaling of B and C is fixed. This allows to observe that the elements of W represent the overall (across subjects) weights of the clusters. The generic k-th cluster, associated to weight wk, can be interpreted with respect to the k-th columns of U, B, andC, denoted by, uk, bk, and ck, respectively. In particular, elements of uk equal to one identify objects assigned to cluster k; elements of bk and ck provide the importance of the variables and subjects, respectively. In other words, objects assigned to the generic k-th cluster are fitted by the k-th variable and subject components. This highlights a one-to-one relation between cluster k and component k, k = 1, …, K, consistently with the CP model.

To investigate the characteristics underlying CPclus, it is convenient to express the model with respect to matrices Xh, h = 1, …, H. An equivalent formulation of (9) is given by

$$\mathbf{X}_{h} =\mathbf{UWC}_{h}\mathbf{B}^{^{\prime}}+\mathbf{E}_{h},$$
(11)

h = 1, …, H, where W is the diagonal matrix of size K having the nonzero elements of Wa in its main diag and Ch is the diagonal matrix of size K having the h-th row of C in its main diagonal, i.e., the component scores pertaining to subject h. Therefore, taking into account model (2), we can conclude that the CPclus centroids are defined by

$$\mathbf{Y}_{h} =\mathbf{WC}_{h}\mathbf{B}^{^{\prime}}+\mathbf{E}_{h},$$
(12)

h = 1, …, H, The result can be explained as follows. Apart from W, which is common to all subjects and provides the weight of the object clusters, the centroids for subject h depend on Ch and B, capturing the three-way structure of the data. Matrix B measures the relevance of the variables for the K clusters/components. Since the same B is assumed across subjects, the underlying idea of CPclus is that the slices are described by the same matrices U and B but in different proportions because B is weighted differently through the subject-specific matrices Ch (h = 1, …, H).

The CPclus solution can be found according to the least-squares approach by minimizing the loss function

$$f = {\Vert \mathbf{E}_{\mathrm a}\Vert }^{2} = {\sum }_{h=1}^{H}{\Vert \mathbf{E}_{h}\Vert }^{2},$$
(13)

with respect to U, B, and C, being || ⋅ || the Frobenius norm of matrices. No constraints are introduced for B and C, while U is constrained to be a binary matrix with row-wise sums equal to one. The constrained minimization of problem (13) is carried out through an alternating least-squares algorithm described in Sect. 3.

2.4 Comparison with Related Models

CPclus takes inspiration from T3clus and the so-called Clustering around Latent Variables for three-Way data (CLV3W) model proposed by Wilderjans and Cariou (2016).

Similarly to T3clus, CPclus focuses on between-cluster differences in mean. As already shown, this is done by simultaneously clustering the objects and extracting components for variables and subjects. Differently from T3clus, where the low-dimensional configuration of variables and subjects is achieved through the Tucker3 model, CPclus considers the CP model. Hence, CPclus can be seen as a constrained version of T3clus, where K = Q = R and H is superdiagonal. This has a welcome side effect of simplicity. More specifically, the simplicity of the CPclus solution relies on the fact that the number of object clusters does not depend on the numbers of variable and subject components as it is for T3clus. In fact, Rocci and Vichi (2005, page 733) observe that the number of clusters K can be selected upon choosing the number of components Q (in the application, R is set equal to one). Therefore, it should be clear that the cluster interpretation in CPclus is easier than that in T3clus, where each cluster must be interpreted by considering all variable and subject components and their interactions provided by the elements of the core H. In CPclus, every cluster is associated with its specific component, and, therefore, it is sufficient to interpret the generic k-th cluster by considering the corresponding k-th component, k = 1, …, K.

With regard to CLV3W, we observe that both CPclus and CLV3W assume, for each cluster, one component resulting from the CP model. Nevertheless, the former is interested in between-cluster differences in mean, the latter in between-cluster differences in covariation. Starting from (10), the CLV3W model can be formalized as

$$\mathbf{X}_{\mathrm a}=\mathrm{diag}\left(\mathbf{a}\right) \mathbf U{\left( \mathbf C\left|\otimes \right| \mathbf B\right)}^{^{\prime}}+\mathbf{E}_{\mathrm a},$$
(14)

where diag(a) is the diagonal matrix with main diagonal equal to a. Vector a of length N contains the component scores for the objects an, n = 1, …, N. Therefore, CLV3W is a particular CP model, with the component matrix for the objects constrained to be A = diag(a)U. In other words, matrix A is such that the generic n-th row contains K – 1 zeros and the element an located at column k, being k the cluster to which object n belongs. For instance, if N = 6, K = 3 and

$${\varvec{\mathbf U}}=\left[\begin{array}{ccc}1& 0& 0\\ 1& 0& 0\\ 1& 0& 0\\ 0& 0& 1\\ 0& 1& 0\\ 0& 1& 0\end{array}\right],$$
(15)

we have

$${\varvec{\mathbf A}}=\left[\begin{array}{ccc}{a}_{1}& 0& 0\\ {a}_{2}& 0& 0\\ {a}_{3}& 0& 0\\ 0& 0& {a}_{4}\\ 0& {a}_{5}& 0\\ 0& {a}_{6}& 0\end{array}\right].$$
(16)

If the elements of a are constrained to be non-negative, we obtain the CLV3W model with Non-Negativity constraints (CLV3W-NN), introduced by Cariou and Wilderjans (2018).

It follows that CLV3W and CLV3W-NN can be seen as three-way extensions of the disjoint principal component analysis (see, e.g., Vichi & Saporta, 2009), where each object has at most one nonzero component score.

By comparing (10) and (14) we can observe that CPclus is a constrained version of both CLV3W and CLV3W-NN with diag(a) = IN. In CPclus, the component scores for the objects are either 1 or 0, i.e., in the same cluster, the component scores are identical (equal to 1), while CLV3W and CLV3W-NN allow such component scores to differ. Such a difference has a relevant impact on the conclusions that can be drawn from the two models, i.e., if the research interest relies in the between-cluster differences in mean or in covariation. CPclus resembles K-means in that they both use exactly the same model for all the objects assigned to the same cluster. In fact, the CPclus model part (C | ⊗| B)′ plays the same role as the centroids do in K-means. Thus, CPclus characterizes indirectly the means of the clusters in terms of the rows of (C | ⊗| B)′. This is not the case for CLV3W and CLV3W-NN due to the presence of the weight vector a, which actually models the objects assigned to the same cluster differently.

To fully understand the distinction between cluster differences in mean or in covariation, we consider three simulated datasets with N = 100, J = 2, and H = 2. Note that details on data generations and model solutions we are going to discuss are given as Supplementary Material.

In the first dataset, displayed in Fig. 1, two clusters are clearly visible. Cluster 1 is composed of objects denoted by empty symbols. They have small values of variable 1 and large values of variable 2 for subject 1, while for subject 2, the opposite holds. Conversely, filled symbols refer to objects belonging to cluster 2. Such objects are characterized by large values of variable 1 and small values of variable 2 for subject 1, while for subject 2 the opposite holds. The differences between the two clusters are more relevant for subject 1 if compared with subject 2. Such differences are in mean and do not coincide with differences in covariation.

Fig. 1
figure 1

Simulated data (N = 100, J = 2, and K = 2) from CLV3W. Empty and filled symbols denote CPclus, CLV3W-NN, MVNM, and PMVNM solutions. Squares and circles denote the CLV3W solution

By applying CPclus with K = 2, the clusters described above are fully identified. This is not the case when applying CLV3W with K = 2 (using the function CLV3W_kmeans, with the default options, of the R package ClustVarLV, Vigneau et al., 2022). Specifically, cluster 1 is formed by (empty and filled) squares and cluster 2 by (empty and filled) circles. Partitions from CLV3W and CPclus have a different meaning; in fact, CLV3W clusters correctly distinguish the objects in covariation. Conversely, by applying CLV3W-NN with K = 2 (specifying the option NN = TRUE in CLV3W_kmeans), we get the same partition as CPclus. Although some elements of a are negative, these are estimated by positive values.

To clarify, we note that the data are generated from the CLV3W model with K = 2 and 5% of noise added. Bearing in mind (14), matrices B and C have elements randomly generated from the standard normal distribution, U contains ones for the first N/2 = 50 rows in the first column and for the last N/2 = 50 rows in the second column. Obviously, the first N/2 elements refer to the objects in cluster 1: one-half is randomly generated from the uniform distribution in [0.5, 0.6] and the other half from the uniform distribution in [– 0.5, – 0.4]. The last N/2 elements of vector a related to objects in cluster 2 are generated in the same way.

CLV3W fits two different CP models with one component (one for each cluster) because it correctly recognizes that two different component structures exist. In other words, CLV3W searches for differences in covariation. Such differences do not necessarily lead to the same differences in mean revealed by CPclus. By requiring non-negativity constraints in CLV3W, i.e., by fitting CLV3W-NN to the data, the estimates of a are non-negative, implying that objects within each cluster are related with the corresponding component in terms of positive covariance.

In this first dataset, the component scores in a differ only within clusters but not between clusters in the sense that, for each cluster, the same random generation process is followed.

We now consider the data displayed in Fig. 2. They are generated by using the same matrices B, C, and U, and the same amount of noise is added, but now the first N/2 elements of a referring to the objects in cluster 1 are randomly generated from the uniform distribution in [– 0.5, – 0.4], and the last N/2 elements referring to the objects in cluster 2 are drawn from the uniform distribution in [0.5, 0.6]. Thus, the component scores in a differ between clusters but not within clusters.

Fig. 2
figure 2

Simulated data (N = 100, J = 2, and K = 2) from CLV3W. Empty and filled symbols denote CPclus, CLV3W, MVNM, and PMVNM solutions. Squares and circles denote the CLV3W-NN solution

The two clusters are correctly identified by CPclus and CLV3W; therefore, the groups are characterized by between-cluster differences in both mean and covariation. Nevertheless, in this case, CLV3W-NN suffers due to the presence of all negative component scores in one cluster. In fact, the negative scores of a pertaining to objects in cluster 1 are estimated as zeroes, and, in general, cluster assignments seem to be based on a random choice.

We conclude the comparisons by considering the data displayed in Fig. 3. Once again, they are generated by using the same matrices B, C, and U with the same amount of noise added, but now a contains the same positive values, which, without loss of generality, can be all ones. Therefore, in this case, the CPclus model, or equivalently, the CLV3W or CLV3W-NN model with diag(a) = IN, is used to generate the data.

Fig. 3
figure 3

Simulated data (N = 100, J = 2, and K = 2) from CPclus, or equivalently from CLV3W or CLV3W-NN with diag(a) = IN. Empty and filled symbols denote CPclus, CLV3W, CLV3W-NN, MVNM, and PMVNM solutions

The plot for subject 1 is more cluttered than the one for subject 2. Nevertheless, two clusters are visible, composed by empty and filled circles, respectively.

By applying CPclus, CLV3W, and CLV3W-NN with K = 2 clusters, we see that all models correctly recover the clusters because, in this case, the partition is formed by groups characterized by between-cluster differences in both mean and covariation and the component scores in a are non-negative. Note also that preliminary analyses we made (details not reported) show that CPclus, CLV3W, and CLV3W-NN produce the same partition when, under the above-described data generation process, the elements of a are positive, but are allowed to vary.

Summing up, we can state that CPclus, on the one side, and CLV3W and CLV3W-NN, on the other side, achieve different objectives, despite the former one can be seen as a constrained version of the latter ones. Generally speaking, clustering is usually intended to account for differences in mean, and if this is the case, the CPclus model should be preferred. Conversely, if one is interested in discovering clusterwise component structures, i.e., differences in covariation, CLV3W should be adopted. The non-negativity constraints in CLV3W-NN imply that it might be positioned in between CPclus and CLV3W-NN. Finally, it might occur that the partitions based on either between-cluster differences in mean or covariation coincide: in this particular case, CPclus, CLV3W, and CLV3W-NN can equivalently be considered and are expected to give the same results, although CPclus is more parsimonious than CLV3W and CLV3W-NN because the N parameters in vector a are not estimated.

We are going to further investigate and compare the performances of CPclus, CLV3W, and CLV3W-NN in the applications on synthetic and real data in Sects. 4 and 5.

For the sake of completeness, we analyze the datasets of Figs. 1, 2, and 3 by the model-based proposals by Viroli (2011) and Sarkar et al. (2020) implemented in the functions MatrixMixt and MatTrans.EM of the R packages MatrixMixtures (Tomarchio et al., 2021) and MatTransMix (Zhu et al., 2022), respectively. The model by Viroli (2011), henceforth denoted by MVNM (Matrix Variate Normal Mixture), assumes a mixture of matrix variate Normal distributions. In this framework, objects are supposed to come from a set of K components (or groups) of unknown proportions. For each object, a data matrix where rows and columns correspond to variables and subjects, respectively, is observed. Each component depends on the mean matrix (where rows and columns correspond to variables and subjects), the covariance matrix for the variables (containing the variances and covariances of the variables), and the covariance matrix for the subjects (containing the variances and covariances of the subjects). The aim of MVNN is to allocate the random sample of objects to the component from which they are assumed to come, under the assumption that each component corresponds to a group. This is done by computing the maximum a posteriori probabilities of group membership.

The same model is assumed by Sarkar et al. (2020) where parsimonious representations of the mean matrices and the covariance matrices are used to avoid overparameterization problems due to the high number of model parameters. The parsimonious structure of the mean matrices is carried out by considering an additive model in place of the general (unconstrained) one. The parsimonious representations of the covariance matrices are obtained from their spectral decompositions. See, for further details, Sarkar et al. (2020). Henceforth, we refer to this model as PMVNM (Parsimonious Matrix Variate Normal Mixture).

By considering a mixture of K = 2 matrix variate Normal distributions, we apply MVNM and PMVNM to the three artificial datasets by using the default options. For all datasets, according to the maximum a posteriori probability estimates, the two models discover the same partitions obtained by CPclus. Hence, we can state that the model-based proposals do not reveal between-cluster differences in covariation but differences in mean. Obviously, they are also able to uncover not only between-cluster differences in mean but also in (row and/or column) covariance. The applicability of the model-based approaches might be prevented whenever the numbers of variables and subjects are relatively large with respect to the number of objects. This is especially true for standard models such as MVNM, but also for the parsimonious versions such as PMVNM. In fact, unless some simple PMVNM representations with fewer parameters, the more flexible and complex ones require a larger number of free parameters. Just to give an example, by considering the general model for the mean matrices and regardless of the possibly parsimonious representations of the covariance matrices, the number of objects must be larger than the number of free parameters of the group-specific mean matrices (N > KJH). In the additive model, it must be N > K(J + H – 1).

Finally, we observe that when H = 1, the three-way data array X reduces to a simple standard two-way matrix, and CPclus reduces to the standard K-means algorithm. This can be explained by noting that, when H = 1, C is no longer a matrix but a row-vector of ones of length K, and, therefore, C | ⊗| B = B that represents the centroid matrix.

2.5 CPclus Model Selection

When the CPclus model is fitted to data where the true number of clusters K is unknown, the number of clusters needs to be specified. As usual in clustering problems, in addition to interpretability reasons, one is advised to vary the number of clusters K* in an appropriate interval of values, inspect the values of the loss function to get some insight into the effect of such choices on the model fit, and further base one’s choices on the substantive interpretability of the solutions. In practice, looking at the scree plot of the loss values over the increasing values of K*, the presence of an elbow may indicate the optimal choice of K*, which does not display a sizeable decrease of the loss.

Such a criterion can be formalized by computing the Scree-Ratio index (Wilderjans & Ceulemans, 2013):

$$\mathrm{SR}=\frac{ {f}_{{K}^{*}} - {f}_{({K}^{*}- {1}\text{)}}}{ {f}_{{\text{(}{K}}^{*}+ \text{ } {1}\text{)}} - {f}_{{K}^{*}}},$$
(17)

where \({f}_{{K}^{*}}\) denotes the loss function value according to (13) computed with K* current clusters. The optimal number of clusters is determined as the K* for which SR is maximal over an appropriate range of K* values.

As we shall see, the performance of SR has been assessed in the simulation study of Sect. 4, where it has achieved a good performance in recovering the underlying (known) clustering structure. Hence, the SR index has also been used in the applications.

Anyway, as always in model selection, it should be noted that the final decision regarding the model to be retained should be also based on the interpretability of the solution.

3 The Algorithm

In order to shed light on the model formulation and to ease its estimation, different formulations are considered here. In (10) the CPclus model is formulated in terms of the supermatrix Xa with the object-by-variable matrices pertaining to the various subjects next to each other. Rows of Xa refer to objects, and its columns refer to all combinations of variables and subjects with variables nested within subjects. Since in three-way analysis, the role of objects, variables, and subjects can be switched, at least two other supermatrices, denoted by Xb and Xc, can be defined. Specifically, Xb of order (J × HN) has rows referring to the variables and columns to the combinations of objects and subjects, with the latter ones nested within the former ones. Finally, Xc has the subjects in the rows and the combinations of variables and objects in the columns (with the latter ones nested within the former ones). Thus, Xc has order (K × NJ). The equivalent formulations of CPclus for Xb and Xc are

$$\mathbf{X}_{\mathrm b}=\mathbf{BW}_{\mathrm b}{(\mathbf U\otimes \mathbf C)}^{^{\prime}}+{\mathbf E}_{\mathrm b}$$
(18)

and

$$\mathbf{X}_{\mathrm c}=\mathbf{CW}_{\mathrm c}{(\mathbf B\otimes \mathbf U)}^{^{\prime}}+\mathbf{E}_{\mathrm c},$$
(19)

respectively, where, with obvious notation, Wb and Wc denote alternative transformations to matrices of the three-way array W. For estimation purposes, the scaling of B and C should not necessarily be fixed (it can be done upon convergence of the algorithm we are going to derive). For this purpose, it is convenient to rewrite (18) and (19) without explicitly using Wb and Wc. Then we get

$$\mathbf{X}_{\mathrm b}=\mathbf{BI}_{\mathrm a}{(\mathbf U\otimes \mathbf C)}^{^{\prime}}+\mathbf{E}_{\mathrm b}=\mathbf B{(\mathbf U\left|\otimes \right| \mathbf C)}^{^{\prime}}+\mathbf{E}_{\mathrm b},$$
(20)

and

$$\mathbf{X}_{\mathrm c}=\mathbf{CI}_{\mathrm a}{(\mathbf B\otimes \mathbf U)}^{^{\prime}}+\mathbf{E}_{\mathrm c}=\mathbf C{(\mathbf B\left|\otimes \right| \mathbf U)}^{^{\prime}}+\mathbf{E}_{\mathrm c},$$
(21)

respectively.

The CPclus model is fitted and estimated in a least-squares framework by solving the following constrained minimization problem:

$$\begin{array}{l}\mathrm{min}{\Vert \mathbf{X}_{\mathrm a}-\mathbf U{(\mathbf C\left|\otimes \right| \mathbf B)}^{^{\prime}}\Vert }^{2},\\ s.t. \ {u}_{nk}\in \left\{\mathrm{0,1}\right\}(n=1,\dots ,N;\ k=1,\dots ,K)\\ \sum_{k=1}^{K}{u}_{nk}=1(n=1,\dots ,N).\end{array},$$
(22)

An alternating least squares (ALS) algorithm has been implemented, which consists of the following steps.

  • Step 0: Initialization.

  • Starting admissible solutions for parameters are chosen randomly or in a rational way.

  • Step 1: Updating component matrix for the variables B.

  • Given the current estimate of U and C, we can consider the CPclus model in formula (20). The loss function can then be rewritten as

    $$\mathrm{min}{\Vert \mathbf{X}_{\mathrm b}-\mathbf B{\left(\mathbf U\left|\otimes \right| \mathbf C\right)}^{^{\prime}}\Vert }^{2}={\Vert \mathbf{X}_{\mathrm b}-{\mathbf{BF}_{\mathrm b}}^{^{\prime}}\Vert }^{2},$$
    (23)
  • with respect to B, where Fb = (U | ⊗| C). This is an unconstrained regression problem with solution

    $$\mathbf{B} = {\mathbf{X}_{\mathrm b}\mathbf{F}_{\mathrm b} (\mathbf{F}_{\mathrm b}\mathrm{^{\prime}}\mathbf{F}_{\mathrm b})}^{-1}.$$
    (24)
  • Step 2: Updating component matrix for the subjects C.

  • Given the current estimate of U and B, the update of C is similar to that of B, provided that formula (21) is considered in place of (20). We have

    $$\mathrm {min}{\Vert \mathbf{X}_{\mathrm c}-\mathbf C{\left(\mathbf B\left|\otimes \right| \mathbf U\right)}^{^{\prime}}\Vert }^{2}={\Vert {\mathbf{X}_{\mathrm c}-\mathbf{CF}_{\mathrm c}}^{^{\prime}}\Vert }^{2},$$
    (25)
  • where Fc = (B |⊗| U). The unconstrained minimization of (25) with respect to C is found by

    $$\mathbf{C} = \mathbf{X}_{\mathrm c}\mathbf{F}_{\mathrm c} {(\mathbf{F}_{\mathrm c}\mathrm{^{\prime}}\mathbf{F}_{\mathrm c})}^{-1}.$$
    (26)
  • Step 3: Updating membership matrix for the objects U.

  • Starting from (10) and given the current estimates of B and C, updating U results in applying the standard K-means algorithm (MacQueen, 1967) to matrix Xa with respect to the given centroid matrix Fa′ = (C |⊗| B)′, which provides the optimal membership matrix U.

  • Step 4: Stopping rule.

  • The loss value f(U, B, C) is computed for the current estimates. When such updated values have decreased considerably (more than an arbitrary small convergence tolerance) the function value, all parameters are updated once more according to steps 1 through 3. Otherwise, the process has converged.

At each step, the algorithm does not increase and generally decreases the loss function, and since function f is bounded from below, it converges to a point that can be expected to be at least a local minimum. To increase the chance of finding the global minimum, the algorithm should be run several times with different starting estimates, and retaining the best solution in terms of minimum loss function value.

Given the unconstrained estimates of B and C, it remains to constrain them to unit sum of squares and to compute the weights in Wa. This can be easily done by replacing B and C with B diag(B′B)–1 and C diag(C′C)–1, respectively, and computing the elements of Wa different from 0 as the diagonal elements of diag(B′B) diag(C′C). As already observed, such calculations do not alter the fit of the solution.

A MATLAB routine, available upon request, implements the CPclus algorithm.

4 Simulation Study

In order to evaluate the performance of the CPclus model and the algorithm proposed in Sect. 3, an extensive simulation study has been carried out on artificially generated data. A comparison with CLV3W and CLV3W-NN, its closest least-squares-based competitors, has also been done.

4.1 Simulation Design and Measures of Performance

A number of datasets have been generated by the true underlying model (10) by setting N = 40, 100 (objects), J = 8, 15 (variables), H = 8, 15, 30 (subjects), and K = 3, 4, 5 (groups). Specifically, the numbers of variables and subjects (J × H) were combined to form five cells: (J = 8, H = 8), (J = 8, H = 30), (J = 15, H = 8), (J = 15, H = 15), and (J = 15, H = 30).

Clusters (and their membership matrices U) have been randomly generated of approximately equal sizes by checking for non-emptiness. In addition, two different structures (BCS) have been set for the component matrices B and C:

BCS1: matrices B and C have been randomly generated from the uniform distribution in [–1, 1];

BCS2: the first columns b1 and c1 of matrices B and C, respectively, have been randomly generated from the uniform distribution in [– 1, 1]; the remaining columns bk and ck, k = 2, …, K, have been built as

$$\mathbf{b}_{k} = \mathbf{b}_{1} + {d\mathrm \;{\mathbf z}}_{\mathbf{B}k},$$
(27)
$$\mathbf{c}_{k} = \mathbf{c}_{1} + {d\mathrm \;{\mathbf z}}_{\mathbf{C}k},$$
(28)

where zBk and zCk are vectors of length J and H, respectively, with elements randomly drawn from the standard normal distribution, and d is a scalar tuning the level of similarity between the columns of B and C. After preliminary analyses, in the simulation study, we have set d = 0.4. This choice has led to centroids close to each other, making structure BCS2 more complex than BCS1, where instead centroids are more separated.

Finally, each data matrix Xa has been built as

$$\mathbf{X}_{\mathrm a}=\mathbf U{(\mathbf C\left|\otimes \right| \mathbf B)}^{^{\prime}}+{\delta \mathbf{E}}_{\mathrm a},$$
(29)

where matrix Ea is a random noise matrix and δ allows to control the level of noise. Each matrix Ea has been drawn from the standard normal distribution and rescaled so that it has the same sum of squares as the error-free data. The values for δ have been chosen equal to 0.75, 1, 1.5, and 2 which correspond to add (100⋅δ) % of noise to the generated data and to set cases with low, medium, high, and very high error, respectively.

For each combination of the levels of the six controlled design factors (N, J, H, K, BCS, and δ), i.e., cell of the design, 100 data sets Xa were generated. Thus, by crossing the design factors, 2 (numbers of objects) × 5 (pairs of numbers of variables and subjects) × 3 (numbers of clusters) × 2 (B and C structures) × 4 (noise levels) × 100 (replications) = 24,000 different data arrays were generated.

In order to evaluate the performance of the algorithm in recovering the true clustering structure, the CPclus model has been fitted to each data array by varying the number of (nonempty) clusters K* = 2, 3, 4, 5, 6, whatever the true number K. Instead, CLV3W and CLV3W-NN were run only setting the true number of clusters (K* = K).

To prevent falling into local optima, for any dataset, the best solution in terms of loss has been retained from 100 different either random or rational starts. Specifically, for 99 runs, the starting membership matrices U were randomly generated, while in one run the rational starting U was derived from the component matrix A of the unconstrained CP solution by setting the maximum value per row to one and all the rest to zero. Thus, the algorithm run 2,400,000 times in total.

The simulation study aimed to analyze the performance of CPclus from different standpoints: goodness of recovery, also in comparison with CLV3W and CLV3W-NN, sensitivity to local minima, and scalability. In addition, in view of the applications, also the performance of the SR index in selecting the correct model has been investigated.

Goodness of Recovery

The recovery performance is analyzed to investigate to what extent the algorithm is able to correctly discover both the partition of objects and the component structures for variables and subjects.

The Adjusted Rand Index (Hubert & Arabie, 1985) between true and fitted partitions when K* = K is computed, which takes its maximum value equal to 1 when the two partitions are coincident; in addition, the percentage of successes in recovering exactly the true partitions is also reported.

As for the recovery of the component matrices B and C, Tucker’s congruence coefficient (Tucker, 1951) is used between true and fitted matrices after considering all cluster permutations and selecting the one that attains the largest value of the coefficient. The coefficient takes values in [0,1], where 1 indicates a perfect recovery.

Sensitivity to Local Minima

In order to analyze how the algorithm is sensitive to falling into local solutions, for each dataset, a guess of the (unknown) “global optimum” is derived as the best solution out of 100 optimal values from the 100 starts, even if in principle such a best solution may be a local optimum itself. Thus, the percentage of solutions where the loss value is less than 0.2% higher than the “global optimum” is computed.

Moreover, the capacity of the rational start to reach the “global optimum” is analyzed by the percentage of times the rational start leads to such an optimal solution.

Scalability

The computation time per run is reported to study the scalability of the algorithm at different experimental conditions. In order to study how the choice of the starting partitions may affect the computation time, both the times per run for the two different starts (random or rational) are reported. The simulation study was run on a personal computer with Intel(R) Core i7-7700 CPU 3.60 GHz processor and 16 GB of RAM.

Model Selection

In order to select the optimal number of clusters/components, CPclus was run with an increasing number of groups K* (i.e., K* = 2, …, 6) regardless of the true number K of clusters generated. The SR index is given in (17), and its percentage of successes in selecting the correct model is computed to assess the validity and utility in practical applications when the true number of clusters/components is unknown.

In summary, for each cell of the experimental design, the following measures have been computed by averaging over the 100 datasets generated:

  • •ARI: Adjusted Rand Index between true and fitted partitions when K* = K;

  • •%(ARI = 1): percentage of successes in recovering the true partitions, i.e., percentage of times where ARI = 1 when K* = K;

  • •TCC(B) and TCC(C): Tucker’s congruence coefficients between true and fitted matrices B and C, respectively;

  • •%GOpt: percentage of solutions with loss value less than 0.2% higher than the “global optimum”;

  • •RndTime: mean computation time per run (in seconds) from random starts;

  • •RatTime: computation time per run (in seconds) from the rational start;

  • •%Rtn: percentage of times the rational start leads to the “global optimum”;

  • •%SR: percentage of times the Scree-Ratio index indicates the correct number of clusters;

  • •SR3, SR4, SR5: Scree-Ratio indices for K* = 3, 4, 5 clusters, respectively.

4.2 Simulation Results (CPclus)

The results of the simulation study are reported and displayed separately for the two different structures (BCS) of the component matrices B and C in Tables 1, 2, 3, and 4. For the sake of brevity, the average TCC(B) and TCC(C) are not reported because they are always very close to 1 with a very limited variation (Table 5), so the degree of congruence between true and fitted component matrices is always optimal (Lorenzo-Seva & ten Berge, 2006).

Table 1 Average values of ARI, %(ARI = 1), SR3, SR4, SR5 and %SR distinguished by N, J, H, K and level of error (BCS1 setting, CPclus).
Table 2 Average values of ARI, %(ARI = 1), SR3, SR4, SR5 and %SR distinguished by N, J, H, K and level of error (BCS2 setting, CPclus).
Table 3 Average values of RndTime, RatTime, %GOpt and %Rtn distinguished by N, J, H, K and level of error (BCS1 setting, CPclus).
Table 4 Average values of RndTime, RatTime, %GOpt and %Rtn distinguished by N, J, H, K and level of error (BCS2 setting, CPclus).
Table 5 Descriptive statistics of the Tucker’s congruence coefficients for the recovery of B and C distinguished by BCS (CPclus)

Firstly, in the BCS1 setting (Table 1), results from the case with low and medium levels of noise exhibit a perfect recovery of the true partition in terms of ARI and %(ARI = 1) for all cells of the experimental design.

Generally speaking, by analyzing Tables 1 and 2, one can see that the cluster recovery, although high, decreases when (a) the amount of noise increases, as expected, (b) the true number of clusters increases, with a slightly better performance in case of large sample size. In the BCS1 setting, where a better separation between the centroids is ensured than in BCS2, the average ARI and the percentage of times CPclus succeeds in perfectly identifying the correct partition remain very high even for high and very high levels of noise, with a slight decline only for the case with small information (J = 8, H = 8).

The situation deteriorates in the BCS2 setting, where the lesser separation between centroids actually increases the level of error in the data, so that in Table 2, it can be observed that the average ARI is lower than the corresponding cells in Table 1 and the %(ARI = 1) dramatically drops in case of high level of noise. However, in the BCS2 case, it can be observed that (a) given the number of variables J, the performance improves as the number of subjects H increases, and (b) given H, the performance improves as J increases but to a lesser extent.

The percentage of hitting the estimated “global optimum” (%GOpt) is generally high in the BCS1 setting (Table 3), which denotes the capability of CPclus of finding the global optimum, and generally decreases as the number of clusters and the level of noise increase. In the BCS2 setting, where the centroids are less distinct, the same trend can be observed, but percentages %GOpt are generally lower than in BCS1, with values that fall down for high levels of noise. In both settings, the search for the global optimum best benefits from a larger sample, while for small samples, %GOpt slightly increases when J and H are higher.

As for the use of the rational start, %Rtn either is almost always maximum or maintains very high values when the amount of noise is not large (Table 4). The chance of finding the optimal solution when starting from the rational partition dramatically drops as the levels of noise increase, especially for small samples and when the data information in terms of numbers of variables J and subjects H is small, which suggests the use of a high number of random starts in such conditions.

All in all, the use of the rational start seems to provide substantial improvements in terms of greater chances of reaching the optimal solution, and this finding is supported by the fact that %Rtn is generally higher than %GOpt across all experimental conditions, so that we may observe that the use of the rational start reduces the risk of hitting local minima. Given the cell, the computation time when using the rational start is usually longer than with the random start. This can be explained by taking into account that the computation time reported also includes the time required to get the unconstrained CP solution upon which the rational start is set.

Further details on computation time emerge by inspecting Tables 3 and 4. We can see that the mean computation time per run increases with the complexity of the data (i.e., number of objects, variables, subjects, and clusters). It can be observed that the divergence between small and large sample sizes narrows as the data are more complex (and probably more informative) in terms of number of variables, subjects, and clusters for all noise levels, with some exceptions when data are very highly perturbed.

Moreover, the increase in computation time for higher numbers of clusters is more pronounced when the data contain large amounts of noise, as expected, while the use of the rational start does not seem to provide substantial differences in computation time.

As for model selection, Tables 1 and 2 display a general optimal performance of the SR index in all conditions, particularly under the BCS1 setting (Table 1), where the percentages of successes in discovering the correct number of clusters are practically perfect. In the more “confused” BCS2 setting (Table 2), the recovery generally remains good in several conditions, but it deteriorates when the amount of noise is very high, especially when the sample size is smaller. Moreover, the recovery seems to be affected by the number of clusters.

Interestingly, in both settings, the average SR indices, as the current number of clusters varies (SR3, SR4, and SR5), present the maximum average values at the correct number of clusters K, but the magnitudes of such maxima are more pronounced when the level of noise is small and decrease drastically as the error increases. In practical applications, this suggests that a clear difference between the magnitude of the maximum value of the SR index as the number of clusters varies indicates a situation with clear partition and little noise.

All in all, even if the combination of added noise and reduced separation of the centroids deteriorates the clustering structure, anyway the SR index still performs well in selecting the correct model.

4.3 Simulation Results (CLV3W and CLV3W-NN)

The CLV3W and its constrained version have been fitted to the same datasets generated for analyzing CPclus, and their performances in terms of ARI and %(ARI = 1) are reported in Tables 6 and 7, respectively. Firstly, we may observe that the two models actually give the same performance with significant differences for neither the two settings nor any experimental condition, so the comparison with CPclus is analogous. The recovery trends are similar when compared with those described above for CPclus. Specifically, the cluster recovery gets worse when the amount of noise increases, the true number of clusters increases, and switching from BCS1 to BCS2 setting. The average values of ARI and %(ARI = 1) are rather high for most of the cells of the experimental design: the main exceptions refer to the BCS2 setting for high and very high levels of noise and the case with small information (J = 8, H = 8).

Table 6 Average values of ARI and %(ARI = 1), distinguished by N, J, H, K, level of error and BCS setting (CLV3W)
Table 7 Average values of ARI and %(ARI = 1), distinguished by N, J, H, K, level of error and BCS setting (CLV3W-NN).

By comparing the performances of CLV3W/CLV3W-NN and CPclus, it can be observed that the latter model outperforms the former ones. In fact, for every cell, the average values of ARI and %(ARI = 1) for CPclus are never lower than the corresponding ones for CLV3W and CLV3W-NN. In some cases, the average values coincide or are almost the same, and this often occurs in conditions where (essentially) perfect recoveries are obtained. In some other cases, the differences in the average values are much more pronounced, and this holds when the level of noise is high or very high and, above all, when J and H decrease.

All in all, we therefore observe that different partitions are discovered by fitting CLV3W, CLV3W-NN, and CPclus, consistently with the different aims of the models. It may occur that considering the between-cluster differences in covariation leads to the same partition as considering the between-cluster differences in mean, but in general, the partitions differ.

For the sake of completeness, we also report the average values of TCC(B) and TCC(C), which are (0.9060; 0.9541) for CLV3W and (0.9052; 0.9533) for CLV3W-NN, respectively. They denote very high degrees of congruence between true and fitted component matrices, even if lower than those for CPclus.

5 Applications

In this section, we report two real applications of the proposed model CPclus and some of its least-squares and model-based competitors, namely CLV3W, CLV3W-NN, T3clus, MVNM, and PMVNM, to two well-known data sets.

5.1 TV Data

CPclus is applied to the so-called TV data (Lundy et al., 1989), consisting of ratings on N = 15 American TV programs (see Table 8) on the basis of J = 16 bipolar scales made by 40 subjects (university students) in 1981. Because of missing ratings, some students have been removed, and the analysis pertains to H = 30 students. Before applying CPclus, variables have been centered across the TV programs, in line with what is done by Rocci and Vichi (2005), who fit T3clus to the same data. Therefore, this application offers a comparative assessment of the CPclus and T3clus results. CLV3W and CLV3W-NN are also applied to the TV data, while neither MVNM nor PMVNM can be used because N < min(J, H).

Table 8 TV data. Partition matrix U for the TV programs (CPclus with K = 5)

Note that Rocci and Vichi (2005) impute missing data by using the T3 model before applying T3clus. However, such an imputation does not seem to remarkably modify the results. In fact, Rocci and Vichi (2005) consider as optimal the solution with K = 5 clusters for the objects, Q = 2 components for the variables, and R = 1 component for the subjects. The percentage of explained sum of squares (based on the dataset with 40 students) is 36.83%. The two dimensions for the variables distinguish the TV programs with respect to educational and informational aspects (Component 1) and entertaining aspects (Component 2). By repeating the analysis with only nonmissing data, we observe that the component scores for the variables slightly differ, but the component interpretation remains the same. The analysis of nonmissing data leads to one difference in the object partition, with All in the Family assigned to the same cluster of Mash and The Tonight Show.

We analyze the preprocessed TV data by means of CPclus and setting K = 5 for comparative purposes. The fit is equal to 45.62%. This value is not fully comparable with the T3clus one because we consider 30 students; moreover, CPclus here has more parameters than T3clus because it has K = 5 components for each of the three modes. However, given the same number of clusters, we can roughly observe a certain increase in fit. We get the matrices U and B reported in Tables 8 and 9, respectively, and the following cluster weights: 72.30, 38.46, 39.16, 56.16, and 55.38. Matrix C is not reported because no information on the subjects is available, and, therefore, it does not help in interpreting the obtained solution. We only note that all but one of the elements are positive: the only exception is score c22 equal to –0.09.

Table 9 TV data. Component matrix B for the bipolar scales (scores higher than 0.25 in absolute value are in bold, CPclus with K = 5)

Cluster 1, characterized by the highest weight, comprises Charlie’s Angels and Let’s Make a Deal. Bearing in mind the non-negativity of the first column of C, these TV programs are perceived, in decreasing order of importance, uninformative, intellectually dull, idiotic, uninteresting, fantasy, and shallow. Hence, we can observe an entertainment dimension, characterized by relaxing TV programs. The Waltons and Little House on the Prairie belong to cluster 2. By inspecting the second column of B, we can conclude that the students (except student n. 2) point out the soporific features of these TV programs. Kojak, Football, and Saturday Night Live are assigned to cluster 3. Several scales describe these TV programs, which are mainly recognized for their roughness. The latter two clusters are associated with the lowest weights. Cluster 4 identifies TV programs regarding culture and information (60 Minutes, News, Jacques Cousteau, and Wild Kingdom) that are considered, among others, informative and intellectually stimulating. Finally, cluster 5 contains Mash, All in the Family, The Tonight Show, and Mork and Mindy. They are considered satirical and funny, thus highlighting a different entertainment dimension in comparison with the one of cluster 1.

By comparing the CPclus and T3clus results, we observe that ARI = 0.67. In particular, two clusters (clusters 1 and 4) resulting from the two models coincide, while cluster 5 is the union of two T3clus clusters. Finally, the remaining two clusters split one T3clus cluster into two equally sized parts.

The CPclus solution is probably not the optimal one in terms of fit and parsimony because a lower value of K seems to provide a more valuable solution fitting the data reasonably well. In order to choose K, we plot the goodness of fit values (expressed as a percentage) for different numbers of clusters K* (Fig. 4) and compute the SR indices (Table 10).

Fig. 4
figure 4

TV data. Goodness of CPclus fit values (expressed as a percentage) for different numbers of clusters K*

Table 10 TV data. SR indices for CPclus solutions with different numbers of clusters K*

By inspecting Fig. 4 and Table 10, we can see the optimal number of clusters/components seems to be K = 4 (fit = 41.36%), which corresponds to the maximum SR index (1.91) and has a slightly lower fit (about – 3%) than the one with K = 5. We skip the solution with K = 6 (second highest SR index) that appears to be too complex, while we decide to examine the parsimonious solution with K = 3 (third highest SR = 1.33; fit = 33.24%). Thus, we investigate in detail these two solutions. When K = 3, the following clusters are found:

  • Cluster 1 (weight 55.38): 60 Minutes, News, Jacques Cousteau, and Wild Kingdom;

  • Cluster 2 (weight 39.88): Charlie’s Angels, Let’s Make a Deal, The Waltons, Kojak, Football, and Little House on the Prairie;

  • Cluster 3 (weight 36.60): Mash, All in the Family, The Tonight Show, Saturday Night Live, and Mork and Mindy.

Cluster 1 was already discovered and refers to TV programs characterized by culture and information. Cluster 3 resembles cluster 5 of the five-cluster solution with the addition of Saturday Night Live. Its interpretation can be made by studying matrix B, reported in Table 11.

Table 11 TV data. Component matrix B for the bipolar scales (scores higher than 0.25 in absolute value are in bold, CPclus with K = 3)

Consistently with our previous findings, cluster 3 can be described by an entertainment dimension related to satirical, funny, and to a lesser extent, erotic. The eroticism represents a novelty and appears to be related to Saturday Night Live. Finally, cluster 2, which is the largest size cluster, contains leisure TV programs recognized as, among others, uninteresting, intellectually dull, and idiotic.

The four-cluster solution is summarized below:

  • Cluster 1 (weight 56.16): The Waltons and Little House on the Prairie;

  • Cluster 2 (weight 55.38): 60 Minutes, News, Jacques Cousteau, Wild Kingdom;

  • Cluster 3 (weight 39.16): Mash, All in the Family, The Tonight Show, and Mork and Mindy;

  • Cluster 4 (weight 46.41): Charlie’s Angels, Let’s Make a Deal, Saturday Night Live, Kojak, and Football.

We can see that it is closely related to the three-cluster solution. Table 12 reports the cross-tabulation of the K = 3 and K = 4 solutions.

Table 12 TV data. Cross tabulation of the CPclus solutions with K = 3 and K = 4

The main difference is that The Waltons and Little House on the Prairie form a separate cluster, and Saturday Night Live is assigned to a different cluster. It follows that three out of four clusters (clusters 1–3) are equal to those obtained setting K = 5. Cluster 4 identifies TV programs to take a break, as witnessed by the last column of matrix B (see Table 13). These programs are perceived, in decreasing order of importance, uninformative, intellectually dull, crude, insensitive, idiotic, shallow, callous, fantasy, leave me cold and uninteresting, in other words, everyday life distractors.

Table 13 TV data. Component matrix B for the bipolar scales (scores higher than 0.25 in absolute value are in bold, CPclus with K = 4)

All in all, we think that the solution with K = 4 clusters/components should be chosen because it is easily interpretable and represents the best compromise between fit and parsimony. The five-cluster solution can be seen as a refinement of the preferred one, but the increasing complexity is not justified by the additional cluster/component. Conversely, the three-cluster solution is too simplistic because TV programs such as The Waltons and Little House on the Prairie are rather different from the others and should reasonably belong to a separate cluster.

For comparative purposes, we also apply CLV3W to the same preprocessed data by setting K = 3, 4, and 5. The fit values are 43.21%, 46.38%, and 47.82%, respectively. The CLV3W solutions are always rather different from the CPclus ones. The ARI values comparing the CLV3W and CPclus solutions with the same number of clusters are 0.19 (K = 3), 0.37 (K = 4), and 0.25 (K = 5).

When K = 3, the CLV3W partition is:

  • Cluster 1: Mash (component score = – 0.20), All in the Family (– 0.31), 60 Minutes (0.45), The Tonight Show (– 0.24), News (0.57), and Mork and Mindy (– 0.53);

  • Cluster 2: Charlie’s Angels (0.42); Let’s Make a Deal (0.25), Saturday Night Live (0.48), Kojak (0.24), Jacques Cousteau (–0.55), and Wild Kingdom (–0.41);

  • Cluster 3: The Waltons (0.64), Football (–0.47), and Little House on the Prairie (0.60).

The component scores for the TV programs noticeably differ within clusters, highlighting that focusing on either between-cluster differences in covariation or in mean leads to different partitions. Taking into account that matrix C has all positive scores, we can interpret the clusters/components as follows (matrix B is not reported). Cluster 1 contains six TV programs, two of them with positive component scores and the remaining four with negative ones. Those with positive scores (the highest one pertains to News) are perceived as not funny, non-satirical, real, informative, and intellectually stimulating, while the opposite holds for the TV programs with negative scores (Mork and Mindy is the highest one in absolute sense). A duality between TV programs with positive and negative scores also emerges in connection with cluster 2. In this case, positive scores (e.g., Charlie’s Angels) denote TV programs considered satirical, violent, fantasy, and erotic. With the same reasoning, we can interpret cluster 3. For instance, Football is recognized, among others, as insensitive, callous, violent, and crude.

The solution with K = 4 clusters is closely related to the one already described. The main difference is the split of cluster 2 into two clusters. More specifically, Charlie’s Angels (score = 0.87) and Kojak (0.49) are the only two elements of a new cluster (cluster 4). The remaining clusters do not vary. The only exception is 60 Minutes, now assigned to cluster 2 (with negative score) together with Let’s Make a Deal, Saturday Night Live, Jacques Cousteau, and Wild Kingdom. The interpretation of the first three clusters/components is very similar to the previous one. In addition, the new cluster concerns TV programs considered violent, erotic, fantasy, and uninformative.

Finally, the K = 5 solutions might be seen as a refinement of the previous two solutions. With respect to the K = 4 solution, clusters 3 and 4 remain the same, while some differences in clusters 1 and 2 can be observed. First of all, cluster 5 contains The Tonight Show and Saturday Night Live. Apart from The Tonight Show, cluster 1 coincides with cluster 1 found for K = 3. Cluster 2 is composed by Let’s Make a Deal, Jacques Cousteau, and Wild Kingdom.

Finally, we provide a summary of the CLV3W-NN solutions with K = 3, 4, and 5. First of all, the fit values are 38.07%, 44.30%, and 46.71%, respectively, hence in between the fit values of CPclus and CLV3W for the same choice of K. Setting K = 3, we get the following partition:

  • Cluster 1: 60 Minutes (component score = 0.48), News (0.53), Jacques Cousteau (0.56), Football (0.13), and Wild Kingdom (0.40);

  • Cluster 2: The Waltons (0.73) and Little House on the Prairie (0.68);

  • Cluster 3: Mash (0.19), Charlie’s Angels (0.32), All in the Family (0.29), The Tonight Show (0.25), Let’s Make a Deal (0.24), Saturday Night Live (0.61), Kojak (0.15), and Mork and Mindy (0.52).

The partition is different from the CPclus and the CLV3W ones with K = 3 (ARI = 0.36 and ARI = 0.12, respectively). Taking into account that the elements of C are non-negative and by inspecting B (not reported), we can conclude that cluster 1 identifies TV programs related to reality, perceived as nonsatirical, not funny, informative, not erotic, real, and intellectually stimulating. Consistently with previous results, The Waltons and Little House on the Prairie are assigned to the same cluster (cluster 2). Cluster 3 is the largest, containing TV programs that, among others, are satirical, fantasy, and funny. By comparing the CLV3W and CLV3W-NN solutions, we can see that the former assigns TV programs rated by students in opposite ways to the same clusters. This is not the case for CLV3W-NN due to non-negativity constraints.

The solutions with K = 4 and 5 are refinements of the K = 3 solution. Note that, as for K = 3, such solutions provide partitions more similar to the CPclus than the CLV3W ones, given the same K. In fact, when K = 4, the ARI values with CPclus and CLV3W are 0.78 and 0.31, respectively, and when K = 5, we get 0.77 and 0.26, respectively. Setting K = 4, we observe that cluster 3 of the K = 3 solution is split into two clusters. One of them also includes Football, which actually have the lowest component score in the K = 3 solution and which no longer belongs to cluster 1 of the K = 3 solution. When K = 5, the partition resembles the one with K = 4 clusters with different assignments for two TV programs, News, and Football, that belong to the new cluster 5.

5.2 Crab Data

The CPclus model is used to address the classification problem of the Crab data (see, e.g., Kroonenberg, 2008). N = 48 blue crabs lived in two areas of North Carolina, Albemarle Sound and Pamlico. All 16 crabs from Albemarle Sound were healthy, while the 32 crabs from Pamlico were divided into two equally sized groups (healthy crabs and diseased crabs). Thus, crabs’ geographical area and health status allow them to distinguish three equally sized clusters. In the analysis, J = 3 tissue samples (gill, hepatopancreas, and muscle) were considered, and data on H = 25 trace elements were collected. The research interest relies on assessing whether the K = 3 known-in-advance clusters can be identified by the observed values of the trace elements for the three tissue samples.

For each combination of tissue sample and trace element, data are standardized before applying CPclus. The quality of the classification setting K = 3 is reported in Table 14.

Table 14 Crab data. Confusion matrix (CPclus with K = 3, cluster weights in parentheses)

Only three crabs out of 48 are not correctly assigned. Specifically, the geographical area is always identified, but two diseased crabs from Pamlico River are recognized as healthy. Therefore, we can conclude that the trace elements distinguish the crabs very well. To investigate the peculiarity of the clusters, we report component matrices B and C in Tables 15 and 16, respectively.

Table 15 Crab data. Component matrix B for the tissue samples (CPclus with K = 3)
Table 16 Crab data. Component matrix C for the trace elements (scores higher than 0.25 in absolute value are in bold, CPclus with K = 3)

From Table 15, we can see that, for every column of B, the highest component score pertains to the Gill, which implies that such a tissue plays the most relevant role in classifying the crabs. The component scores for the trace elements highlight which are the key aspects for distinguishing the crabs. To clarify this point, from Table 16, we can see that the diseased crabs are characterized by large values of, among others, aluminum, cobalt, manganese, and iron, i.e., the highest scores of the first column of C. Such diseased crabs from Pamlico River are distinguished by the healthy ones from the same area because the latter ones (in cluster 3) have high values of nickel, phosphor, and magnesium and low values of cadmium and vanadium (third column of C). The second column of C allows us to describe the distinctive features of the crabs in Albemarle Sound: high levels of copper and cadmium and low levels of, e.g., magnesium and cobalt.

Given the cluster/component interpretation, we investigate the misclassifications of the three crabs. Limiting our attention to the Gill tissue, we find that the three crabs involved present different characteristics with respect to the other diseased crabs for the six trace elements that mostly influence the component. In detail, for the diseased crabs, the values on the left tails of the distributions of aluminum, cobalt, manganese, iron, etc. almost always pertain to the misclassified crabs. For such crabs, it is not clear how the disease is related to the trace elements.

By applying CLV3W and CLV3W-NN with K = 3, we obtain different partitions summarized in the confusion matrices of Tables 17 and 18, respectively.

Table 17 Crab data. Confusion matrix (CLV3W with K = 3)
Table 18 Crab data. Confusion matrix (CLV3W-NN with K = 3)

Obviously, these cluster structures depend on rather different underlying components. Just to give an example from the CLV3W solution, the first component scores for the trace elements higher than 0.25 in absolute value are 0.42 (potassium), – 0.38 (chrome), and 0.31 (silver). By inspecting Table 16, we can observe that such three trace elements have a negligible role in all the three CPclus components (in mean).

Given the size of the array, MVNM and PMVNM cannot be applied. In fact, even by considering the additive model for the mean matrices, the number of objects is smaller than the number of free parameters (N < K(J + H – 1)). However, by considering Table 16, we empirically select a subset of three trace elements that seem to play a relevant role in classifying the crabs. These are arsenic, cadmium, and iron. All of them have large scores in absolute values for two components. We model such a smaller dataset of order (48 × 3 × 3) by using MVNM and PMVNM with K = 3. The confusion matrices are reported in Tables 19 and 20.

Table 19 Crab data. Confusion matrix (MVNM with K = 3)
Table 20 Crab data. Confusion matrix (PMVNM with K = 3)

We can observe a pretty good recovery performance for both MVNM and PMVNM, bearing in mind that it is based on only three out of the 25 trace elements.

6 Concluding Remarks

We have presented the CPclus model for simultaneous reduction of objects, variables, and subjects of three-way data. It achieves this goal by seeking both clusters of objects and components for variables and subjects. The three-way structure of the data is properly taken into account by assuming a Candecomp/Parafac configuration for the variable and subject components. The object partition is found by following a K-means strategy, where the Candecomp/Parafac-based model part for variables and subjects plays the same role as the centroids do in K-means. As we have extensively discussed, differently from the closely related CLV3W and CLV3W-NN models, which focus on the between-cluster differences in covariation, CPclus identifies a partition of objects based on the between-cluster differences in mean. This peculiarity is consistent with the model-based proposals. However, differently from them, CPclus does not require that the number of objects is much larger than those of variables and subjects and, in general, can be applied to three-way arrays of arbitrary size without any additional assumption (as required for the model-based models).

An extensive simulation study has been carried out to assess the ability to recover clusters and components and to investigate on computational issues. Moreover, some applications to real-life data have been presented to show the potentiality and the effectiveness of CPclus. Comparisons with some of its potential competitors based on the simulated and real data have also been made. Results are satisfactory, stimulating future research on the topic. The methodology might be improved according to the fuzzy approach to clustering, by replacing the hard allocation step of the objects to clusters with soft membership degrees. Another interesting line of research might be the attempt to introduce weights for variables and/or subjects to enhance clustering accuracy. Furthermore, a hierarchical approach might be adopted for partitioning the objects. Finally, it will be interesting to investigate how CPclus might perform if the data are highly skewed or contain outliers, and possibly propose suitable robust versions.