CPclus: Candecomp/Parafac Clustering Model for Three-Way Data

Vicari, Donatella; Giordani, Paolo

doi:10.1007/s00357-023-09440-4

CPclus: Candecomp/Parafac Clustering Model for Three-Way Data

Original Research
Open access
Published: 15 June 2023

Volume 40, pages 432–465, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Classification Aims and scope Submit manuscript

CPclus: Candecomp/Parafac Clustering Model for Three-Way Data

Download PDF

959 Accesses
1 Altmetric
Explore all metrics

Abstract

A novel clustering model, CPclus, for three-way data concerning a set of objects on which variables are measured by different subjects is proposed. The main aim of the proposal is to simultaneously summarize the objects through clusters and both variables and subjects through components. The object clusters are found by adopting a K-means-based strategy where the centroids are reduced according to the Candecomp/Parafac model in order to exploit the three-way structure of the data. The clustering process is carried out in order to reveal between-cluster differences in mean. Least-squares fitting is performed by using an iterative alternating least-squares algorithm. Model selection is addressed by considering an elbow-based method. An extensive simulation study and some real-life applications show the effectiveness of the proposal, also in comparison with its potential competitors.

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

An INDSCAL-Type Approach for Three-Way Spectral Clustering

GINDCLUS: Generalized INDCLUS with External Information

Article 12 October 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In many domains of research, it frequently occurs that the available information pertains to a number of objects on which a set of variables are measured by different subjects (or on different occasions). This information can be stored in a three-way data array or tensor. The three sets of entities (objects, variables, and subjects) associated with such a three-way data array are usually referred to as modes. A three-way array can be seen as a collection of standard two-way, two-mode (objects, variables) matrices, one for each subject. It is obvious that three-way data offer richer information but, at the same time, are more complex than two-way data. It is therefore fundamental to develop new statistical tools able to exploit the potentialities of three-way data. In fact, classical statistical methods might be used to analyze three-way data. For instance, it would be sufficient to juxtapose next to each other the matrices pertaining to the various subjects to obtain a matrix with rows corresponding to objects and columns corresponding to all the combinations of variables and subjects. This matrix can be analyzed by classical methods, but conclusions are usually inadequate because the three-way structure is not properly taken into account, i.e., methods simply ignore that variables are replicated across subjects.

A traditional goal of a three-way analysis involves dimension reduction in order to summarize the observed data through a limited number of components. For this purpose, two popular choices are represented by the Tucker3 (Tucker, 1966) and Candecomp/Parafac (Carroll & Chang, 1970; Harshman, 1970) models. The former assumes different numbers of components for objects, variables, and subjects, and their triple interactions are evaluated in a reduced three-way array called core. The latter is a simpler model, assuming a common number of components and a constrained structure of the core. This implies that, differently from Tucker3, the Candecomp/Parafac solution is essentially unique (Kruskal, 1977).

Although the two models above are widely applied for extracting components from three-way data, they do not allow to discover clusters. In this respect, Tucker3 and Candecomp/Parafac have been generalized to the clustering framework. We can roughly distinguish two main classes of techniques aiming at partitioning the entities referring to a single mode or to more modes simultaneously. A common case is referred to as co-clustering when two or three modes are clustered (see Madeira & Oliveira, 2004, for a comprehensive review of the two-way case, and Papalexakis et al., 2013, for a proposal in the three-way case). In this paper, we are going to focus on clustering a single mode, seeking, without loss of generality, for a partition of objects, which is by far the most common case.

Starting from Krijnen (1993) and Krijnen and Kiers (1993), several clustering models for three-way data can be found in the literature following either a model-based (see, e.g., Viroli, 2011; Gallaugher & McNicholas, 2018, 2020; Melnykov & Zhu, 2018; Silva et al., 2019; Sarkar et al., 2020; Tomarchio et al., 2022) or a least-squares approach (see, e.g., Rocci & Vichi, 2005; Vichi et al., 2007; Wilderjans & Ceulemans, 2013; Wilderjans & Cariou, 2016; Cariou & Wilderjans, 2018; Cariou et al., 2021; Schoonees et al., 2022). Here, the least-squares approach will be followed throughout the paper.

Within such an approach, Wilderjans and Ceulemans (2013) clearly observe that two different strategies can be adopted. The former one consists of models focusing on the between-cluster differences in covariation. In these models, the components extracted within each cluster are different from those extracted within the remaining clusters. Therefore, the assumption is that objects assigned to the same cluster subsume the same component structure, whereas objects belonging to different clusters have different underlying components. The latter strategy concerns models aiming at discovering between-cluster differences in mean, i.e., models where the differences in the mean profiles of the different clusters are explained by components. This can be done by summarizing only variables and subjects through components, whilst objects are partitioned into a reduced number of clusters. It follows that objects are analyzed asymmetrically with respect to variables and subjects. Specifically, objects are assigned to clusters following a K-means-type (MacQueen, 1967) procedure where the centroids lie in the low-dimensional space spanned by the components for the variables and the subjects. Partitioning of objects and dimension reduction of both variables and subjects are performed simultaneously. We refer to Wilderjans and Ceulemans (2013) for further details on the between-cluster differences in covariation and mean.

Examples of models focusing on the between-cluster differences in covariation include Wilderjans and Ceulemans (2013) and Wilderjans and Cariou (2016), among others. Wilderjans and Ceulemans (2013) propose a clusterwise version of the Candecomp/Parafac model, where objects are classified in a limited number of clusters and, simultaneously, the objects assigned to each cluster are modeled through the Candecomp/Parafac model. Wilderjans and Cariou (2016) develop a model, further extended by Cariou and Wilderjans (2018) and Cariou et al. (2021), that boils down to simultaneously group the objects into clusters and apply Candecomp/Parafac with one component for each cluster. In Cariou and Wilderjans (2018), non-negativity constraints are introduced in the estimation process.

Rocci and Vichi (2005) and Vichi et al. (2007) are the most prominent models aiming at discovering between-cluster differences in mean. In these proposals, suitable generalizations of the Tucker3 model are developed by introducing a binary object partition matrix in place of the standard component matrix for the objects. Rocci and Vichi (2005) and Vichi et al. (2007) can also be seen as three-way generalizations of, respectively, the reduced K-means (De Soete & Carroll, 1994) and Factorial K-means (Vichi & Kiers, 2001) methods for standard two-way data where clusters of objects and components for variables are simultaneously determined.

In this paper, we propose a new clustering model for three-way data where objects are assigned to clusters and, simultaneously, variables and subjects are reduced consistently with the latter strategy described above. For ease of interpretation and for the uniqueness property, differently from Rocci and Vichi (2005) and Vichi et al. (2007), which represent the starting point of our work, we consider the Candecomp/Parafac model with one component for each cluster. In this respect, the proposed model is also closely related to Wilderjans and Cariou (2016). However, it is important to stress that the objectives of the two models are different. Namely, our model analyses between-cluster differences in mean because it adopts a K-means strategy where the Candecomp/Parafac-based model part for variables and subjects plays the same role as the centroids do in K-means, whilst Wilderjans and Cariou (2016) analyze between-cluster differences in covariation. As we shall see in detail by some clarifying examples, this implies that only our model is a partitioning method in a strict sense (differences in mean) because Wilderjans and Cariou (2016) identify groups of objects with the same underlying structure (differences in covariation), but this does not necessarily lead to standard clusters distinct in mean.

The paper is organized as follows: In the next section, the novel clustering model is proposed and formalized in a general framework. Some issues related to special cases and model assessment are discussed, together with a theoretical and practical comparison with the existing techniques. The estimation algorithm is presented in Sect. 3. Section 4 contains the results of an extended simulation study carried out in order to evaluate the performance of the proposed model and algorithm, also in comparison with its potential and direct competitors. Some applications to real data are given in Sect. 5. Finally, the last section is devoted to some concluding remarks.

2 The Clustering Model

In this section, we introduce a novel clustering model for three-way data exploiting the potentialities of both the K-means clustering algorithm and the Candecomp/Parafac model. For this purpose, we start from a general K-means-based clustering model for three-way data and, later, we recall the Tucker3 and Candecomp/Parafac models together with T3clus, the model proposed by Rocci and Vichi (2005).

2.1 A General K-Means Based Clustering Model

Let us suppose J variables are measured on N objects by H subjects. Such data are stored in the three-way array X of order (N × J × H), whose generic element is x_njh, expressing the measurement of object n (n = 1, …, N) with respect to variable j (j = 1, …, J) made by subject h (h = 1, …, H). Array X can be seen as a collection of matrices, one for every subject. Therefore, matrix X_h (h = 1, …, H) of size (N × J) is usually referred to as slice and contains all measurements from subject h.

The most general clustering model can be fully specified as follows:

$$\mathbf{X}_{h} = \mathbf{U}_{h}\mathbf{Y}_{h} + \mathbf{E}_{h},$$

(1)

h = 1, …, H, where E_h is the error term for subject h and U_h is the membership matrix of order (N × K) for objects into clusters, being K the number of clusters. Matrix U_h is binary with only one entry equal to 1 per row and identifies a partition of the N objects into K disjoint clusters for subject h (h = 1, …, H). Matrix Y_h (h = 1, …, H) of order (K × J) is the so-called subject-specific centroid matrix. The generic k-th row of Y_h refers to the centroid of cluster k expressed in terms of the J observed variables. Thus, the model assumes a different partition across the slices related to the subjects. In other words, separate partitions are independently sought by a K-means-type model for every subject.

In order to exploit the three-way structure of the data, i.e., to properly take into account that the same variables are observed on the same objects by the subjects, constrained versions of model (1) can be derived. For instance, we may assume that U_h = U, h = 1, …, H. Then, we get

$$\mathbf{X}_{h} =\mathbf{UY}_{h} +\mathbf{E}_{h},$$

(2)

h = 1, …, H. Matrix U is a membership matrix, subject to the same constraints as U_h. Therefore, model (2) identifies a common partition across subjects. As in model (1), different centroid matrices Y_h are assumed, allowing for possible differences among subjects. Therefore, model (2) is a K-means-type model with a consensus partition specified by U.

Let us generally denote by A_a = [A₁|A₂| …|A_H] a supermatrix where matrices A_h (h = 1,…, H) are collected next to each other. Thus, A_a contains all the object-by-variable matrices pertaining to the subjects next to each other. When applied to array X, we obtain a supermatrix X_a of order (N × JH). With obvious notation, model (2) can be equivalently rewritten in terms of X_a as

$$\mathbf{X}_{\mathrm a} =\mathbf{UY}_{\mathrm a} +\mathbf{E}_{\mathrm a}.$$

(3)

In addition, model (3) can be extended by synthesizing the centroids through a limited number of components for the variables and the subjects. In order to describe such extensions, the Tucker3 and Candecomp/Parafac models need to be recalled.

2.2 The Tucker3, T3clus, and Candecomp/Parafac Models

The Tucker3 (T3) model, proposed by Tucker (1966), can be expressed as

$$\mathbf{X}_{\mathrm a}=\mathbf{AH}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a},$$

(4)

where A of order (N × P), B of order (J × Q), and C of order (H × R) are the component matrices for the objects, the variables, and the subjects, respectively, being P (≤ N) the number of object components, Q (≤ J) the number of variable components, and R (≤ H) the number of subject components. Without loss of fit, A, B, and C can be constrained to be columnwise orthonormal. Moreover, H_a is the supermatrix obtained from the so-called core array H of order (P × Q × R), expressing the interactions among the object, variable, and subject components. Different numbers of components in A, B, and C can be assumed to deal with different levels of complexity for objects, variables, and subjects. Finally, the symbol ⊗ denotes the Kronecker product.

Rocci and Vichi (2005) propose the T3clus model to simultaneously partition the objects into clusters and reduce variables and subjects by components. It is formalized as

$$\mathbf{X}_{\mathrm a}=\mathbf{UH}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a}.$$

(5)

By comparing (4) and (5), we can see that the main difference relies in the binary matrix U, which replaces the component matrix for the objects A. It follows that in T3clus, the elements of H_a express the interactions among the object clusters, the variable components, and the subject components. Therefore, T3clus is a T3 model with binary constraints on one matrix (A, renamed as U). It is a special case of the so-called NMFA/GENNCLUS model, mentioned by Carroll and Chaturvedi (1995).

As for Tucker3, the T3clus solution is not unique. Specifically, if the component matrices are rotated by arbitrary non-singular rotation matrices, it is sufficient to counterrotate the core supermatrix to obtain an equal fitting solution. In detail, letting R and S be two non-singular rotation matrices of appropriate order, we get the rotated matrices B_R = BS and C_R = CT and the counterrotated supermatrix H_aR = H_a (T^–1 ⊗ S^–1)′ such that

$$\mathbf{UH}_{\mathrm aR}{(\mathbf{C}_{R}\otimes \mathbf{B}_{R})}^{^{\prime}}=\mathbf{UH}_{\mathrm a}{\left(\mathbf{T}^{-1}\otimes \mathbf{S}^{-1}\right)}^{^{\prime}}{(\mathbf{CT}\otimes \mathbf{BS})}^{^{\prime}}=\mathbf{UH}_{a}{\left(\mathbf{T}^{-1}\otimes \mathbf{S}^{-1}\right)}^{^{\prime}}{\left(\mathbf{T}\otimes \mathbf{S}\right)}^{^{\prime}}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}=\mathbf{UH}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}.$$

(6)

The use of T3 is sometimes prevented by difficulties in interpreting the solution, due to the interactions given by the core. A simpler alternative is represented by the Candecomp/Parafac (CP) model, independently proposed by Carroll and Chang (1970) and Harshman (1970). Although T3 and CP present several distinctive features, the latter model can be seen as a constrained version of the former one with the same number of components for objects, variables and subjects (P = Q = R = K) and a core, denoted by W, which has now order (K × K × K), with only K elements different from zero. In particular, the core is assumed to be superdiagonal, i.e., the nonzero elements are w_kkk, k = 1, …, K. This implies that a component in a certain mode can only be related to a single component in another mode. We have

$$\mathbf{X}_{\mathrm a}=\mathbf{AW}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a}.$$

(7)

The component matrices A of order (N × K), B of order (J × K), and C of order (H × K) are not necessarily columnwise orthonormal. Their scaling is often fixed by setting diag(A′A) = diag(B′B) = diag(C′C) = I_K, where I_K is the identity matrix of order K and diag(∙) denotes the operator that returns the diagonal matrix with the diagonal elements of its argument. If the scaling is not fixed and A, B, and C are fully unconstrained, W_a is superfluous, and the CP model can be equivalently formulated as

$$\mathbf{X}_{\mathrm a}=\mathbf{AI}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a}=\mathbf{A}{(\mathbf{C}\left|\otimes \right| \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a},$$

(8)

where I_a has the same structure as W_a, but its nonzero elements are equal to 1, and the symbol | ⊗| denotes the Khatri-Rao product. The equivalent CP formulations in (7) and (8) may help observe that, differently from T3, the CP solution is unique up to scaling and permuting the components. This implies that the component interpretation is unique. Such a property is highly appreciated in several domains of research.

2.3 The CPclus Model

We can now introduce our proposal. It can be seen as an extension of the CP model in order to simultaneously get a partition of the objects in K clusters and a dimension reduction of variables and subjects through K components. Bearing in mind model (7), we propose the so-called CPclus model that is formalized as

$$\mathbf{X}_{\mathrm a}=\mathbf{UW}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a},$$

(9)

where U of order (N × K) is the binary partition matrix for the objects, B and C are the component matrices for the variables and the subjects, respectively, such that diag(B′B) = diag(C′C) = I_K, and W_a is the core supermatrix with non-zero elements denoted by w_k (k = 1, …, K). In this way, we fix the scaling of the component matrices so that, differently from T3clus, the CPclus solution is unique (up to cluster/component labeling). As for CP, an equivalent formulation can be derived without fixing the scaling of B and C. In this case, by replacing W_a with I_a, we get

$$\mathbf{X}_{\mathrm a}=\mathbf{UI}_{\mathrm a}{(\mathbf{C}\otimes \mathbf{B})}^{^{\prime}}+\mathbf{E}_{\mathrm a}=\mathbf U{(\mathbf C\left|\otimes \right| \mathbf B)}^{^{\prime}}+\mathbf{E}_{\mathrm a}.$$

(10)

As we shall see, the expression in (10) will be used for estimation purposes. In general, we prefer to consider the CPclus formulation in (9) by assuming that the scaling of B and C is fixed. This allows to observe that the elements of W represent the overall (across subjects) weights of the clusters. The generic k-th cluster, associated to weight w_k, can be interpreted with respect to the k-th columns of U, B, andC, denoted by, u_k, b_k, and c_k, respectively. In particular, elements of u_k equal to one identify objects assigned to cluster k; elements of b_k and c_k provide the importance of the variables and subjects, respectively. In other words, objects assigned to the generic k-th cluster are fitted by the k-th variable and subject components. This highlights a one-to-one relation between cluster k and component k, k = 1, …, K, consistently with the CP model.

To investigate the characteristics underlying CPclus, it is convenient to express the model with respect to matrices X_h, h = 1, …, H. An equivalent formulation of (9) is given by

$$\mathbf{X}_{h} =\mathbf{UWC}_{h}\mathbf{B}^{^{\prime}}+\mathbf{E}_{h},$$

(11)

h = 1, …, H, where W is the diagonal matrix of size K having the nonzero elements of W_a in its main diag and C_h is the diagonal matrix of size K having the h-th row of C in its main diagonal, i.e., the component scores pertaining to subject h. Therefore, taking into account model (2), we can conclude that the CPclus centroids are defined by

$$\mathbf{Y}_{h} =\mathbf{WC}_{h}\mathbf{B}^{^{\prime}}+\mathbf{E}_{h},$$

(12)

h = 1, …, H, The result can be explained as follows. Apart from W, which is common to all subjects and provides the weight of the object clusters, the centroids for subject h depend on C_h and B, capturing the three-way structure of the data. Matrix B measures the relevance of the variables for the K clusters/components. Since the same B is assumed across subjects, the underlying idea of CPclus is that the slices are described by the same matrices U and B but in different proportions because B is weighted differently through the subject-specific matrices C_h (h = 1, …, H).

The CPclus solution can be found according to the least-squares approach by minimizing the loss function

$$f = {\Vert \mathbf{E}_{\mathrm a}\Vert }^{2} = {\sum }_{h=1}^{H}{\Vert \mathbf{E}_{h}\Vert }^{2},$$

(13)

with respect to U, B, and C, being || ⋅ || the Frobenius norm of matrices. No constraints are introduced for B and C, while U is constrained to be a binary matrix with row-wise sums equal to one. The constrained minimization of problem (13) is carried out through an alternating least-squares algorithm described in Sect. 3.

2.4 Comparison with Related Models

CPclus takes inspiration from T3clus and the so-called Clustering around Latent Variables for three-Way data (CLV3W) model proposed by Wilderjans and Cariou (2016).

Similarly to T3clus, CPclus focuses on between-cluster differences in mean. As already shown, this is done by simultaneously clustering the objects and extracting components for variables and subjects. Differently from T3clus, where the low-dimensional configuration of variables and subjects is achieved through the Tucker3 model, CPclus considers the CP model. Hence, CPclus can be seen as a constrained version of T3clus, where K = Q = R and H is superdiagonal. This has a welcome side effect of simplicity. More specifically, the simplicity of the CPclus solution relies on the fact that the number of object clusters does not depend on the numbers of variable and subject components as it is for T3clus. In fact, Rocci and Vichi (2005, page 733) observe that the number of clusters K can be selected upon choosing the number of components Q (in the application, R is set equal to one). Therefore, it should be clear that the cluster interpretation in CPclus is easier than that in T3clus, where each cluster must be interpreted by considering all variable and subject components and their interactions provided by the elements of the core H. In CPclus, every cluster is associated with its specific component, and, therefore, it is sufficient to interpret the generic k-th cluster by considering the corresponding k-th component, k = 1, …, K.

With regard to CLV3W, we observe that both CPclus and CLV3W assume, for each cluster, one component resulting from the CP model. Nevertheless, the former is interested in between-cluster differences in mean, the latter in between-cluster differences in covariation. Starting from (10), the CLV3W model can be formalized as

$$\mathbf{X}_{\mathrm a}=\mathrm{diag}\left(\mathbf{a}\right) \mathbf U{\left( \mathbf C\left|\otimes \right| \mathbf B\right)}^{^{\prime}}+\mathbf{E}_{\mathrm a},$$

(14)

where diag(a) is the diagonal matrix with main diagonal equal to a. Vector a of length N contains the component scores for the objects a_n, n = 1, …, N. Therefore, CLV3W is a particular CP model, with the component matrix for the objects constrained to be A = diag(a)U. In other words, matrix A is such that the generic n-th row contains K – 1 zeros and the element a_n located at column k, being k the cluster to which object n belongs. For instance, if N = 6, K = 3 and

$${\varvec{\mathbf U}}=\left[\begin{array}{ccc}1& 0& 0\\ 1& 0& 0\\ 1& 0& 0\\ 0& 0& 1\\ 0& 1& 0\\ 0& 1& 0\end{array}\right],$$

(15)

we have

$${\varvec{\mathbf A}}=\left[\begin{array}{ccc}{a}_{1}& 0& 0\\ {a}_{2}& 0& 0\\ {a}_{3}& 0& 0\\ 0& 0& {a}_{4}\\ 0& {a}_{5}& 0\\ 0& {a}_{6}& 0\end{array}\right].$$

(16)

If the elements of a are constrained to be non-negative, we obtain the CLV3W model with Non-Negativity constraints (CLV3W-NN), introduced by Cariou and Wilderjans (2018).

It follows that CLV3W and CLV3W-NN can be seen as three-way extensions of the disjoint principal component analysis (see, e.g., Vichi & Saporta, 2009), where each object has at most one nonzero component score.

By comparing (10) and (14) we can observe that CPclus is a constrained version of both CLV3W and CLV3W-NN with diag(a) = I_N. In CPclus, the component scores for the objects are either 1 or 0, i.e., in the same cluster, the component scores are identical (equal to 1), while CLV3W and CLV3W-NN allow such component scores to differ. Such a difference has a relevant impact on the conclusions that can be drawn from the two models, i.e., if the research interest relies in the between-cluster differences in mean or in covariation. CPclus resembles K-means in that they both use exactly the same model for all the objects assigned to the same cluster. In fact, the CPclus model part (C | ⊗| B)′ plays the same role as the centroids do in K-means. Thus, CPclus characterizes indirectly the means of the clusters in terms of the rows of (C | ⊗| B)′. This is not the case for CLV3W and CLV3W-NN due to the presence of the weight vector a, which actually models the objects assigned to the same cluster differently.

To fully understand the distinction between cluster differences in mean or in covariation, we consider three simulated datasets with N = 100, J = 2, and H = 2. Note that details on data generations and model solutions we are going to discuss are given as Supplementary Material.

In the first dataset, displayed in Fig. 1, two clusters are clearly visible. Cluster 1 is composed of objects denoted by empty symbols. They have small values of variable 1 and large values of variable 2 for subject 1, while for subject 2, the opposite holds. Conversely, filled symbols refer to objects belonging to cluster 2. Such objects are characterized by large values of variable 1 and small values of variable 2 for subject 1, while for subject 2 the opposite holds. The differences between the two clusters are more relevant for subject 1 if compared with subject 2. Such differences are in mean and do not coincide with differences in covariation.

By applying CPclus with K = 2, the clusters described above are fully identified. This is not the case when applying CLV3W with K = 2 (using the function CLV3W_kmeans, with the default options, of the R package ClustVarLV, Vigneau et al., 2022). Specifically, cluster 1 is formed by (empty and filled) squares and cluster 2 by (empty and filled) circles. Partitions from CLV3W and CPclus have a different meaning; in fact, CLV3W clusters correctly distinguish the objects in covariation. Conversely, by applying CLV3W-NN with K = 2 (specifying the option NN = TRUE in CLV3W_kmeans), we get the same partition as CPclus. Although some elements of a are negative, these are estimated by positive values.

To clarify, we note that the data are generated from the CLV3W model with K = 2 and 5% of noise added. Bearing in mind (14), matrices B and C have elements randomly generated from the standard normal distribution, U contains ones for the first N/2 = 50 rows in the first column and for the last N/2 = 50 rows in the second column. Obviously, the first N/2 elements refer to the objects in cluster 1: one-half is randomly generated from the uniform distribution in [0.5, 0.6] and the other half from the uniform distribution in [– 0.5, – 0.4]. The last N/2 elements of vector a related to objects in cluster 2 are generated in the same way.

CLV3W fits two different CP models with one component (one for each cluster) because it correctly recognizes that two different component structures exist. In other words, CLV3W searches for differences in covariation. Such differences do not necessarily lead to the same differences in mean revealed by CPclus. By requiring non-negativity constraints in CLV3W, i.e., by fitting CLV3W-NN to the data, the estimates of a are non-negative, implying that objects within each cluster are related with the corresponding component in terms of positive covariance.

In this first dataset, the component scores in a differ only within clusters but not between clusters in the sense that, for each cluster, the same random generation process is followed.

We now consider the data displayed in Fig. 2. They are generated by using the same matrices B, C, and U, and the same amount of noise is added, but now the first N/2 elements of a referring to the objects in cluster 1 are randomly generated from the uniform distribution in [– 0.5, – 0.4], and the last N/2 elements referring to the objects in cluster 2 are drawn from the uniform distribution in [0.5, 0.6]. Thus, the component scores in a differ between clusters but not within clusters.

The two clusters are correctly identified by CPclus and CLV3W; therefore, the groups are characterized by between-cluster differences in both mean and covariation. Nevertheless, in this case, CLV3W-NN suffers due to the presence of all negative component scores in one cluster. In fact, the negative scores of a pertaining to objects in cluster 1 are estimated as zeroes, and, in general, cluster assignments seem to be based on a random choice.

We conclude the comparisons by considering the data displayed in Fig. 3. Once again, they are generated by using the same matrices B, C, and U with the same amount of noise added, but now a contains the same positive values, which, without loss of generality, can be all ones. Therefore, in this case, the CPclus model, or equivalently, the CLV3W or CLV3W-NN model with diag(a) = I_N, is used to generate the data.

The plot for subject 1 is more cluttered than the one for subject 2. Nevertheless, two clusters are visible, composed by empty and filled circles, respectively.

By applying CPclus, CLV3W, and CLV3W-NN with K = 2 clusters, we see that all models correctly recover the clusters because, in this case, the partition is formed by groups characterized by between-cluster differences in both mean and covariation and the component scores in a are non-negative. Note also that preliminary analyses we made (details not reported) show that CPclus, CLV3W, and CLV3W-NN produce the same partition when, under the above-described data generation process, the elements of a are positive, but are allowed to vary.

Summing up, we can state that CPclus, on the one side, and CLV3W and CLV3W-NN, on the other side, achieve different objectives, despite the former one can be seen as a constrained version of the latter ones. Generally speaking, clustering is usually intended to account for differences in mean, and if this is the case, the CPclus model should be preferred. Conversely, if one is interested in discovering clusterwise component structures, i.e., differences in covariation, CLV3W should be adopted. The non-negativity constraints in CLV3W-NN imply that it might be positioned in between CPclus and CLV3W-NN. Finally, it might occur that the partitions based on either between-cluster differences in mean or covariation coincide: in this particular case, CPclus, CLV3W, and CLV3W-NN can equivalently be considered and are expected to give the same results, although CPclus is more parsimonious than CLV3W and CLV3W-NN because the N parameters in vector a are not estimated.

We are going to further investigate and compare the performances of CPclus, CLV3W, and CLV3W-NN in the applications on synthetic and real data in Sects. 4 and 5.

For the sake of completeness, we analyze the datasets of Figs. 1, 2, and 3 by the model-based proposals by Viroli (2011) and Sarkar et al. (2020) implemented in the functions MatrixMixt and MatTrans.EM of the R packages MatrixMixtures (Tomarchio et al., 2021) and MatTransMix (Zhu et al., 2022), respectively. The model by Viroli (2011), henceforth denoted by MVNM (Matrix Variate Normal Mixture), assumes a mixture of matrix variate Normal distributions. In this framework, objects are supposed to come from a set of K components (or groups) of unknown proportions. For each object, a data matrix where rows and columns correspond to variables and subjects, respectively, is observed. Each component depends on the mean matrix (where rows and columns correspond to variables and subjects), the covariance matrix for the variables (containing the variances and covariances of the variables), and the covariance matrix for the subjects (containing the variances and covariances of the subjects). The aim of MVNN is to allocate the random sample of objects to the component from which they are assumed to come, under the assumption that each component corresponds to a group. This is done by computing the maximum a posteriori probabilities of group membership.

The same model is assumed by Sarkar et al. (2020) where parsimonious representations of the mean matrices and the covariance matrices are used to avoid overparameterization problems due to the high number of model parameters. The parsimonious structure of the mean matrices is carried out by considering an additive model in place of the general (unconstrained) one. The parsimonious representations of the covariance matrices are obtained from their spectral decompositions. See, for further details, Sarkar et al. (2020). Henceforth, we refer to this model as PMVNM (Parsimonious Matrix Variate Normal Mixture).

By considering a mixture of K = 2 matrix variate Normal distributions, we apply MVNM and PMVNM to the three artificial datasets by using the default options. For all datasets, according to the maximum a posteriori probability estimates, the two models discover the same partitions obtained by CPclus. Hence, we can state that the model-based proposals do not reveal between-cluster differences in covariation but differences in mean. Obviously, they are also able to uncover not only between-cluster differences in mean but also in (row and/or column) covariance. The applicability of the model-based approaches might be prevented whenever the numbers of variables and subjects are relatively large with respect to the number of objects. This is especially true for standard models such as MVNM, but also for the parsimonious versions such as PMVNM. In fact, unless some simple PMVNM representations with fewer parameters, the more flexible and complex ones require a larger number of free parameters. Just to give an example, by considering the general model for the mean matrices and regardless of the possibly parsimonious representations of the covariance matrices, the number of objects must be larger than the number of free parameters of the group-specific mean matrices (N > KJH). In the additive model, it must be N > K(J + H – 1).

Finally, we observe that when H = 1, the three-way data array X reduces to a simple standard two-way matrix, and CPclus reduces to the standard K-means algorithm. This can be explained by noting that, when H = 1, C is no longer a matrix but a row-vector of ones of length K, and, therefore, C | ⊗| B = B that represents the centroid matrix.

2.5 CPclus Model Selection

When the CPclus model is fitted to data where the true number of clusters K is unknown, the number of clusters needs to be specified. As usual in clustering problems, in addition to interpretability reasons, one is advised to vary the number of clusters K^* in an appropriate interval of values, inspect the values of the loss function to get some insight into the effect of such choices on the model fit, and further base one’s choices on the substantive interpretability of the solutions. In practice, looking at the scree plot of the loss values over the increasing values of K^*, the presence of an elbow may indicate the optimal choice of K^*, which does not display a sizeable decrease of the loss.

Such a criterion can be formalized by computing the Scree-Ratio index (Wilderjans & Ceulemans, 2013):

$$\mathrm{SR}=\frac{ {f}_{{K}^{*}} - {f}_{({K}^{*}- {1}\text{)}}}{ {f}_{{\text{(}{K}}^{*}+ \text{ } {1}\text{)}} - {f}_{{K}^{*}}},$$

(17)

where ${f}_{{K}^{*}}$ denotes the loss function value according to (13) computed with K^* current clusters. The optimal number of clusters is determined as the K^* for which SR is maximal over an appropriate range of K^* values.

As we shall see, the performance of SR has been assessed in the simulation study of Sect. 4, where it has achieved a good performance in recovering the underlying (known) clustering structure. Hence, the SR index has also been used in the applications.

Anyway, as always in model selection, it should be noted that the final decision regarding the model to be retained should be also based on the interpretability of the solution.

3 The Algorithm

In order to shed light on the model formulation and to ease its estimation, different formulations are considered here. In (10) the CPclus model is formulated in terms of the supermatrix X_a with the object-by-variable matrices pertaining to the various subjects next to each other. Rows of X_a refer to objects, and its columns refer to all combinations of variables and subjects with variables nested within subjects. Since in three-way analysis, the role of objects, variables, and subjects can be switched, at least two other supermatrices, denoted by X_b and X_c, can be defined. Specifically, X_b of order (J × HN) has rows referring to the variables and columns to the combinations of objects and subjects, with the latter ones nested within the former ones. Finally, X_c has the subjects in the rows and the combinations of variables and objects in the columns (with the latter ones nested within the former ones). Thus, X_c has order (K × NJ). The equivalent formulations of CPclus for X_b and X_c are

$$\mathbf{X}_{\mathrm b}=\mathbf{BW}_{\mathrm b}{(\mathbf U\otimes \mathbf C)}^{^{\prime}}+{\mathbf E}_{\mathrm b}$$

(18)

and

$$\mathbf{X}_{\mathrm c}=\mathbf{CW}_{\mathrm c}{(\mathbf B\otimes \mathbf U)}^{^{\prime}}+\mathbf{E}_{\mathrm c},$$

(19)

respectively, where, with obvious notation, W_b and W_c denote alternative transformations to matrices of the three-way array W. For estimation purposes, the scaling of B and C should not necessarily be fixed (it can be done upon convergence of the algorithm we are going to derive). For this purpose, it is convenient to rewrite (18) and (19) without explicitly using W_b and W_c. Then we get

$$\mathbf{X}_{\mathrm b}=\mathbf{BI}_{\mathrm a}{(\mathbf U\otimes \mathbf C)}^{^{\prime}}+\mathbf{E}_{\mathrm b}=\mathbf B{(\mathbf U\left|\otimes \right| \mathbf C)}^{^{\prime}}+\mathbf{E}_{\mathrm b},$$

(20)

and

$$\mathbf{X}_{\mathrm c}=\mathbf{CI}_{\mathrm a}{(\mathbf B\otimes \mathbf U)}^{^{\prime}}+\mathbf{E}_{\mathrm c}=\mathbf C{(\mathbf B\left|\otimes \right| \mathbf U)}^{^{\prime}}+\mathbf{E}_{\mathrm c},$$

(21)

respectively.

The CPclus model is fitted and estimated in a least-squares framework by solving the following constrained minimization problem:

$$\begin{array}{l}\mathrm{min}{\Vert \mathbf{X}_{\mathrm a}-\mathbf U{(\mathbf C\left|\otimes \right| \mathbf B)}^{^{\prime}}\Vert }^{2},\\ s.t. \ {u}_{nk}\in \left\{\mathrm{0,1}\right\}(n=1,\dots ,N;\ k=1,\dots ,K)\\ \sum_{k=1}^{K}{u}_{nk}=1(n=1,\dots ,N).\end{array},$$

(22)

An alternating least squares (ALS) algorithm has been implemented, which consists of the following steps.

Step 0: Initialization.
Starting admissible solutions for parameters are chosen randomly or in a rational way.
Step 1: Updating component matrix for the variables B.
Given the current estimate of U and C, we can consider the CPclus model in formula (20). The loss function can then be rewritten as
$$\mathrm{min}{\Vert \mathbf{X}_{\mathrm b}-\mathbf B{\left(\mathbf U\left|\otimes \right| \mathbf C\right)}^{^{\prime}}\Vert }^{2}={\Vert \mathbf{X}_{\mathrm b}-{\mathbf{BF}_{\mathrm b}}^{^{\prime}}\Vert }^{2},$$
(23)
with respect to B, where F_b = (U | ⊗| C). This is an unconstrained regression problem with solution
$$\mathbf{B} = {\mathbf{X}_{\mathrm b}\mathbf{F}_{\mathrm b} (\mathbf{F}_{\mathrm b}\mathrm{^{\prime}}\mathbf{F}_{\mathrm b})}^{-1}.$$
(24)
Step 2: Updating component matrix for the subjects C.
Given the current estimate of U and B, the update of C is similar to that of B, provided that formula (21) is considered in place of (20). We have
$$\mathrm {min}{\Vert \mathbf{X}_{\mathrm c}-\mathbf C{\left(\mathbf B\left|\otimes \right| \mathbf U\right)}^{^{\prime}}\Vert }^{2}={\Vert {\mathbf{X}_{\mathrm c}-\mathbf{CF}_{\mathrm c}}^{^{\prime}}\Vert }^{2},$$
(25)
where F_c = (B |⊗| U). The unconstrained minimization of (25) with respect to C is found by
$$\mathbf{C} = \mathbf{X}_{\mathrm c}\mathbf{F}_{\mathrm c} {(\mathbf{F}_{\mathrm c}\mathrm{^{\prime}}\mathbf{F}_{\mathrm c})}^{-1}.$$
(26)
Step 3: Updating membership matrix for the objects U.
Starting from (10) and given the current estimates of B and C, updating U results in applying the standard K-means algorithm (MacQueen, 1967) to matrix X_a with respect to the given centroid matrix F_a′ = (C |⊗| B)′, which provides the optimal membership matrix U.
Step 4: Stopping rule.
The loss value f(U, B, C) is computed for the current estimates. When such updated values have decreased considerably (more than an arbitrary small convergence tolerance) the function value, all parameters are updated once more according to steps 1 through 3. Otherwise, the process has converged.

At each step, the algorithm does not increase and generally decreases the loss function, and since function f is bounded from below, it converges to a point that can be expected to be at least a local minimum. To increase the chance of finding the global minimum, the algorithm should be run several times with different starting estimates, and retaining the best solution in terms of minimum loss function value.

Given the unconstrained estimates of B and C, it remains to constrain them to unit sum of squares and to compute the weights in W_a. This can be easily done by replacing B and C with B diag(B′B)^–1 and C diag(C′C)^–1, respectively, and computing the elements of W_a different from 0 as the diagonal elements of diag(B′B) diag(C′C). As already observed, such calculations do not alter the fit of the solution.

A MATLAB routine, available upon request, implements the CPclus algorithm.

4 Simulation Study

In order to evaluate the performance of the CPclus model and the algorithm proposed in Sect. 3, an extensive simulation study has been carried out on artificially generated data. A comparison with CLV3W and CLV3W-NN, its closest least-squares-based competitors, has also been done.

4.1 Simulation Design and Measures of Performance

A number of datasets have been generated by the true underlying model (10) by setting N = 40, 100 (objects), J = 8, 15 (variables), H = 8, 15, 30 (subjects), and K = 3, 4, 5 (groups). Specifically, the numbers of variables and subjects (J × H) were combined to form five cells: (J = 8, H = 8), (J = 8, H = 30), (J = 15, H = 8), (J = 15, H = 15), and (J = 15, H = 30).

Clusters (and their membership matrices U) have been randomly generated of approximately equal sizes by checking for non-emptiness. In addition, two different structures (BCS) have been set for the component matrices B and C:

BCS1: matrices B and C have been randomly generated from the uniform distribution in [–1, 1];

BCS2: the first columns b₁ and c₁ of matrices B and C, respectively, have been randomly generated from the uniform distribution in [– 1, 1]; the remaining columns b_k and c_k, k = 2, …, K, have been built as

$$\mathbf{b}_{k} = \mathbf{b}_{1} + {d\mathrm \;{\mathbf z}}_{\mathbf{B}k},$$

(27)

$$\mathbf{c}_{k} = \mathbf{c}_{1} + {d\mathrm \;{\mathbf z}}_{\mathbf{C}k},$$

(28)

where z_Bk and z_Ck are vectors of length J and H, respectively, with elements randomly drawn from the standard normal distribution, and d is a scalar tuning the level of similarity between the columns of B and C. After preliminary analyses, in the simulation study, we have set d = 0.4. This choice has led to centroids close to each other, making structure BCS2 more complex than BCS1, where instead centroids are more separated.

Finally, each data matrix X_a has been built as

$$\mathbf{X}_{\mathrm a}=\mathbf U{(\mathbf C\left|\otimes \right| \mathbf B)}^{^{\prime}}+{\delta \mathbf{E}}_{\mathrm a},$$

(29)

where matrix E_a is a random noise matrix and δ allows to control the level of noise. Each matrix E_a has been drawn from the standard normal distribution and rescaled so that it has the same sum of squares as the error-free data. The values for δ have been chosen equal to 0.75, 1, 1.5, and 2 which correspond to add (100⋅δ) % of noise to the generated data and to set cases with low, medium, high, and very high error, respectively.

For each combination of the levels of the six controlled design factors (N, J, H, K, BCS, and δ), i.e., cell of the design, 100 data sets X_a were generated. Thus, by crossing the design factors, 2 (numbers of objects) × 5 (pairs of numbers of variables and subjects) × 3 (numbers of clusters) × 2 (B and C structures) × 4 (noise levels) × 100 (replications) = 24,000 different data arrays were generated.

In order to evaluate the performance of the algorithm in recovering the true clustering structure, the CPclus model has been fitted to each data array by varying the number of (nonempty) clusters K^* = 2, 3, 4, 5, 6, whatever the true number K. Instead, CLV3W and CLV3W-NN were run only setting the true number of clusters (K^* = K).

To prevent falling into local optima, for any dataset, the best solution in terms of loss has been retained from 100 different either random or rational starts. Specifically, for 99 runs, the starting membership matrices U were randomly generated, while in one run the rational starting U was derived from the component matrix A of the unconstrained CP solution by setting the maximum value per row to one and all the rest to zero. Thus, the algorithm run 2,400,000 times in total.

The simulation study aimed to analyze the performance of CPclus from different standpoints: goodness of recovery, also in comparison with CLV3W and CLV3W-NN, sensitivity to local minima, and scalability. In addition, in view of the applications, also the performance of the SR index in selecting the correct model has been investigated.

Goodness of Recovery

The recovery performance is analyzed to investigate to what extent the algorithm is able to correctly discover both the partition of objects and the component structures for variables and subjects.

The Adjusted Rand Index (Hubert & Arabie, 1985) between true and fitted partitions when K^* = K is computed, which takes its maximum value equal to 1 when the two partitions are coincident; in addition, the percentage of successes in recovering exactly the true partitions is also reported.

As for the recovery of the component matrices B and C, Tucker’s congruence coefficient (Tucker, 1951) is used between true and fitted matrices after considering all cluster permutations and selecting the one that attains the largest value of the coefficient. The coefficient takes values in [0,1], where 1 indicates a perfect recovery.

Sensitivity to Local Minima

In order to analyze how the algorithm is sensitive to falling into local solutions, for each dataset, a guess of the (unknown) “global optimum” is derived as the best solution out of 100 optimal values from the 100 starts, even if in principle such a best solution may be a local optimum itself. Thus, the percentage of solutions where the loss value is less than 0.2% higher than the “global optimum” is computed.

Moreover, the capacity of the rational start to reach the “global optimum” is analyzed by the percentage of times the rational start leads to such an optimal solution.

Scalability

The computation time per run is reported to study the scalability of the algorithm at different experimental conditions. In order to study how the choice of the starting partitions may affect the computation time, both the times per run for the two different starts (random or rational) are reported. The simulation study was run on a personal computer with Intel(R) Core i7-7700 CPU 3.60 GHz processor and 16 GB of RAM.

Model Selection

In order to select the optimal number of clusters/components, CPclus was run with an increasing number of groups K^* (i.e., K^* = 2, …, 6) regardless of the true number K of clusters generated. The SR index is given in (17), and its percentage of successes in selecting the correct model is computed to assess the validity and utility in practical applications when the true number of clusters/components is unknown.

In summary, for each cell of the experimental design, the following measures have been computed by averaging over the 100 datasets generated:

•ARI: Adjusted Rand Index between true and fitted partitions when K^* = K;
•%(ARI = 1): percentage of successes in recovering the true partitions, i.e., percentage of times where ARI = 1 when K^* = K;
•TCC(B) and TCC(C): Tucker’s congruence coefficients between true and fitted matrices B and C, respectively;
•%GOpt: percentage of solutions with loss value less than 0.2% higher than the “global optimum”;
•RndTime: mean computation time per run (in seconds) from random starts;
•RatTime: computation time per run (in seconds) from the rational start;
•%Rtn: percentage of times the rational start leads to the “global optimum”;
•%SR: percentage of times the Scree-Ratio index indicates the correct number of clusters;
•SR3, SR4, SR5: Scree-Ratio indices for K^* = 3, 4, 5 clusters, respectively.

4.2 Simulation Results (CPclus)

The results of the simulation study are reported and displayed separately for the two different structures (BCS) of the component matrices B and C in Tables 1, 2, 3, and 4. For the sake of brevity, the average TCC(B) and TCC(C) are not reported because they are always very close to 1 with a very limited variation (Table 5), so the degree of congruence between true and fitted component matrices is always optimal (Lorenzo-Seva & ten Berge, 2006).

Table 1 Average values of ARI, %(ARI = 1), SR3, SR4, SR5 and %SR distinguished by N, J, H, K and level of error (BCS1 setting, CPclus).

CPclus: Candecomp/Parafac Clustering Model for Three-Way Data

Abstract

Similar content being viewed by others

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

An INDSCAL-Type Approach for Three-Way Spectral Clustering

GINDCLUS: Generalized INDCLUS with External Information

1 Introduction

2 The Clustering Model

2.1 A General K-Means Based Clustering Model

2.2 The Tucker3, T3clus, and Candecomp/Parafac Models

2.3 The CPclus Model

2.4 Comparison with Related Models

2.5 CPclus Model Selection

3 The Algorithm

4 Simulation Study

4.1 Simulation Design and Measures of Performance

Goodness of Recovery

Sensitivity to Local Minima

Scalability

Model Selection

4.2 Simulation Results (CPclus)

4.3 Simulation Results (CLV3W and CLV3W-NN)

5 Applications

5.1 TV Data

5.2 Crab Data

6 Concluding Remarks

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 24.3 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation