A parameter-less algorithm for tensor co-clustering

The majority of the data produced by human activities and modern cyber-physical systems involve complex relations among their features. Such relations can be often represented by means of tensors, which can be viewed as generalization of matrices and, as such, can be analyzed by using higher-order extensions of existing machine learning methods, such as clustering and co-clustering. Tensor co-clustering, in particular, has been proven useful in many applications, due to its ability of coping with n-modal data and sparsity. However, setting up a co-clustering algorithm properly requires the specification of the desired number of clusters for each mode as input parameters. This choice is already difficult in relatively easy settings, like flat clustering on data matrices, but on tensors it could be even more frustrating. To face this issue, we propose a new tensor co-clustering algorithm that does not require the number of desired co-clusters as input, as it optimizes an objective function based on a measure of association across discrete random variables (called Goodman and Kruskal’s τ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau$$\end{document}) that is not affected by their cardinality. We introduce different optimization schemes and show their theoretical and empirical convergence properties. Additionally, we show the effectiveness of our algorithm on both synthetic and real-world datasets, also in comparison with state-of-the-art co-clustering methods based on tensor factorization and latent block models.


Introduction
The increasing complexity of the data produced by humans and cyber-physical systems requires more sophisticated machine learning algorithms able to handle it and take advantage of the manifold of the variable space. This phenomenon also affects data structures that, on the one hand, should adapt to large datasets and, on the other hand, should be able Editor: Petra Kralj Novak, Tomislav Šmuc. 1 3 to represent complex relations among data instances. A clear example of such evolution is certainly constituted by tensors, which have gained much attention in the last twenty years.
Tensors are widely used mathematical objects that well represent complex information such as gene expression data (Zhao and Zaki 2005), social networks (Hong and Jung 2018), heterogenous information networks (Ermis et al. 2015;Yu et al. 2019), timeevolving data (Araujo et al. 2018), behavioral patterns (He et al. 2018), and multi-lingual text corpora (Papalexakis and Dogruöz 2015). In general, every n-ary relation can be easily represented as a tensor. From the algebraic point of view, in fact, they can be seen as multimodal generalizations of matrices and, as such, can be processed with mathematical and computational methods that generalize those usually employed to analyze data matrices, e.g., non-negative factorization (Shashua and Hazan 2005), singular value decomposition (Zhang and Golub 2001), itemset and association rule mining (Cerf et al. 2009;Nguyen et al. 2011;Cerf et al. 2013), clustering and co-clustering (Banerjee et al. 2007;Wu et al. 2016).
Clustering, in particular, is by far one of the most popular unsupervised machine learning techniques since it allows analysts to obtain an overview of the intrinsic similarity structures of the data with relatively little background knowledge about them. However, with the availability of high-dimensional heterogenous data, co-clustering has gained popularity, since it provides a simultaneous partitioning of each mode (rows and columns of the matrix, in the two-dimensional case). In practice, it copes with the curse of dimensionality problem by performing clustering on the main dimension (data objects or instances) while applying dimensionality reduction on the other dimension (features). Despite its proven usefulness, the correct application of tensor co-clustering is limited by the fact that it requires the specification of a congruent number of clusters for each mode, while, in realistic analysis scenarios, the actual number of clusters is unknown. Furthermore, matrix/tensor (co-)clustering is often based on a preliminary tensor factorization step that, in its turn, requires further input parameters (e.g., the number of latent factors within each mode). As a consequence, it is merely impossible to explore all combinations of parameter values in order to identify the best clustering results.
The main reason for this problem is that most clustering algorithms (and tensor factorization approaches) optimize objective functions that strongly depend on the number of clusters (or factors). Hence, two solutions with two different numbers of clusters can not be compared directly. Although this considerably reduces the size of the search space, it prevents the discovery of a better partitioning once a wrong number of clusters is selected. In this paper, which extends our previous work (Battaglia and Pensa 2019), we address this limitation by proposing a new tensor co-clustering algorithm that optimizes a new class of objective functions that can be viewed as n-modal extensions of an association measure called Goodman-Kruskal's (Goodman and Kruskal 1954), whose local optima do not depend on the number of clusters. We model our tensor co-clustering approach as a multi-objective optimization problem and discuss, both theoretically and experimentally, the convergence properties of our extensions and of the related optimization schemes. Additionally, we conduct a thorough experimental validation showing that our algorithms provide accurate clustering results in each mode of the tensor. Compared with state-of-theart techniques that require the desired number of clusters in each mode as input parameters, it achieves similar or better results at the price of a reasonable increase of the running time. Additionally, it is also effective in clustering real-world datasets.
In summary, the main contributions of this paper are as follows: (1) we define a new class of objective functions for n-mode tensor co-clustering, based on Goodman-Kruskal's association measure, which do not require the number of clusters as input parameter; (2) we propose several variants of a multi-objective optimization algorithm, based on stochastic local search, and study their convergence properties showing that they support the rapid convergence towards a local optimum; (3) we show the effectiveness of our method experimentally on both synthetic and real-world data, also in comparison with state-of-theart competitors. The remainder of the paper is organized as follows: the related works are analyzed in Sect. 2; the generalization of the Goodman-Kruskal's association measure is presented in Sect.3 while the variants of the optimization algorithms are described in Sect. 4; Sect. 5 provides the report of our experiments; finally, we draw some conclusions in Sect. 6.

Related work
Analyzing multi-way data (or n-way tensors) has attracted a lot of attention due to their intrinsic complexity and richness. Hence, to deal with this complexity, in the last two decades, many ad hoc methods and extensions of 2-way matrix methods have been proposed, many of which are tensor decomposition models and algorithms (Kolda and Bader 2009). As an example, both singular value decomposition (Zhang and Golub 2001) and non-negative matrix factorization (Shashua and Hazan 2005) have been extended to work with high-order tensor data. Furthermore, knowledge discovery and exploratory data mining techniques, including closed itemset mining (Cerf et al. 2009(Cerf et al. , 2013 and association rule discovery (Nguyen et al. 2011), have been successfully applied to n-way data as well.
The problem of clustering and co-clustering of higher-order data has also been extensively addressed. Co-clustering has been developed as a matrix method and studied in many different application contexts including text mining (Dhillon et al. 2003;Pensa et al. 2014), gene expression analysis (Cho et al. 2004) and graph mining (Chakrabarti et al. 2004) and has been naturally extended to tensors for its ability of handling n-modal highdimensional data well. Banerjee et al. (2007) perform clustering using a relation graph model that describes all the known relations between the modes of a tensor. Their tensor clustering formulation captures the maximal information in the relation graph by exploiting a family of loss functions known as Bregman divergences. They also present several structurally different multi-way clustering schemes involving a scalable algorithm based on alternate minimization. Instead, Zhou et al. (2009) use tensor-based latent factor analysis to address co-clustering in the context of web usage mining. Their algorithm is executed via the well-known multi-way decomposition algorithm called CANDECOMP/PARA-FAC (Harshman 1970). Papalexakis et al. (2013) formulate co-clustering as a constrained multi-linear decomposition with sparse latent factors. They propose a basic multi-way co-clustering algorithm exploiting multi-linearity using Lasso-type coordinate updates. Additionally, they propose a line search optimization approach based on iterative majorization and polynomial fitting. Zhang et al. (2013) propose an extension of the tri-factor non-negative matrix factorization model (Ding et al. 2006) to a tensor decomposition model performing adaptive dimensionality reduction by integrating the subspace identification and the (hard or soft) clustering process into a single process. Their algorithm computes two basis matrices representing the common characteristics of the samples and one 3-D tensor denoting the peculiarities of the samples. The model can be used to perform dimensional reduction as well. Instead, Wu et al. (2016) introduce a spectral co-clustering method based on a new random walk model for nonnegative square tensors.
Other more recent approaches (Boutalbi et al. 2019a, b) rely on an extension of the latent block model. In these works, co-clustering for sparse tensor data is viewed as a multi-way clustering model where each slice of the third mode of the tensor represents a relation between two sets. Finally, Wang and Zeng (2019) present a co-clustering approach for tensors by using a least-square estimation procedure for identifying n-way block structure that applies to binary, continuous, and hybrid data instances.
Differently from all these approaches, our tensor co-clustering algorithm is not based on any factorization method or block model hypothesis. Instead, it optimizes an extension of a measure of association whose effectiveness has been proven in matrix (2-way) clustering (Huang et al. 2012) and co-clustering (Ienco et al. 2013), and that naturally helps discover the correct number of clusters in tensors with arbitrary shape and density. It is worth noting, in fact, that the co-clustering performances of all the methods mentioned in this section strongly rely on the correct choice of the number of clusters/factors/blocks, which limits their application in realistic data analysis scenarios.

An association measure for tensor co-clustering
In this section, we introduce the objective function optimized by our tensor co-clustering algorithm (presented in the next section). It consists in an association measure, called Goodman and Kruskal's (Goodman and Kruskal 1954), that evaluates the dependence between two discrete variables and has been used to assess the quality of 2-way co-clustering (Robardet and Feschet 2001) with good partitioning results. We generalize its definition to a n-mode tensor setting. Goodman and Kruskal's (Goodman and Kruskal 1954) is an association measure that estimates the strength of the link between two discrete variables X and Y according to the proportional reduction of the error in predicting one of them knowing the other. In more details, let x 1 , … , x m be the values that variable X can assume, with probability p X (1), … , p X (m) and let y 1 , … , y n be the possible values Y can assume, with probability p Y (1), … , p Y (n) . The error in predicting X can be evaluated as the probability that two different observations from the marginal distribution of X fall in different categories:

Goodman and Kruskal and its generalization
Similarly, the error in predicting X knowing that Y has value y j is and the expected value of the error in predicting X knowing Y is Then the Goodman and Kruskall X|Y measure of association is defined as Conversely, the proportional reduction of the error in predicting Y while X is known is In order to use this measure for the evaluation of a tensor co-clustering, we need to extend it so that can evaluate the association of n distinct discrete variables. Let X 1 , … , X n be discrete variables such that X i can assume m i distinct values (for simplicity, we will denote the possible values as 1, … , m i ), for i = 1, … , n . Let p X i (k) be the probability that X i = k , for k = 1, … , m i , for i = 1, … , n . Reasoning as in the two-dimensional case, we can define the reduction in the error in predicting X i while (X j ) j≠i are all known as for all i ≤ n . When n = 2 , the measure coincides with Goodman-Kruskal's . Notice that, in the n-dimensional case as well as in the 2-dimensional case, the error in predicting X i knowing the value of the other variables is always positive and smaller or equal to the error in predicting X i without any knowledge about the other variables. It follows that X i takes values between [0, 1]. It will be 0 if knowledge of prediction of the other variables is of no help in predicting X i , while it will be 1 if knowledge of the values assumed by variables (X j ) j≠i completely specifies X i .

Tensor co-clustering with Goodman-Kruskal's
Let X ∈ ℝ m 1 ×⋯×m n + be a tensor with n modes and non-negative values. Let us denote with x k 1 …k n the generic element of X , where k i = 1, … , m i for each mode i = 1, … , n . A coclustering P of X is a collection of n partitions j is a partition of the elements on the i-th mode of X in c i groups, with c i ≤ m i for each i = 1, … , n . Each co-clustering P can be associated to a tensor T P ∈ ℝ c 1 ×⋯×c n + , whose generic element is (1)

3
Consider now n discrete variables X 1 , … , X n , where each X i takes values in {C i 1 , … C i c i } . We can look at T P as the contingency n-modal table that empirically estimates the joint distribution of X 1 , … , X n : the entry t k 1 …k n represents the absolute frequency of the event ) and the frequency of X i = C i k is the marginal frequency obtained by summing all entries t k 1 …k i−1 kk i+1 …k n , with k 1 , … , k i−1 , k i+1 , … , k n varying trough all possible values and the i-th index k i fixed to k. In the same way, we can compute the frequency of the event as the sum of all elements t k 1 …k n of T P having k i = k and k j = h . More in general, we can compute the marginal joint distribution of d < n variables as the sum of all the entries of T P having the indices corresponding to the d variables fixed to the values we are considering. For instance, given T P ∈ ℝ 4×3×5×2 + , the absolute frequency of the event From now on, we will use the newly introduced notation t to denote the sum of all elements of a tensor having the modes in the upper vector (in the example (1, 3)) fixed to the values of the lower vector (in the example (3,4)). A formal definition of the scalar t can result clunky: given a tensor T ∈ ℝ m 1 ×⋯×m n + and two vectors , we will use the following notation where ̄ is the vector of dimension r = n − d containing all the integers i ≤ n that are not in and e i = w i if i ∈ while e i = k i otherwise. Summarizing, given a tensor X with n modes and a co-clustering P over X , we obtain a tensor T P that represents the empirical frequency of n discrete variables X 1 , … , X n each of them with c i possible values (where c i is the number of clusters in the partition on the i-th mode of X ). Therefore, we can derive from T P the probability distributions of variables X 1 , … , X n and substitute them in Eq. 1: in this way we associate to each co-clustering P over X a vector P = ( P X 1 , … , P X n ) that can be used to evaluate the quality of the co-clustering. In particular, for any i, j ≤ n and any k i = 1, … , c i : where T is the sum of all entries of T P . It follows that for each i = 1, … , n . The overall co-clustering schema is depicted in Fig. 1.
It is worth pointing out that the procedure just described makes sense when the tensor X itself can be interpreted as a contingency tensor; the main assumption of our method is that the quantity t k 1 …k n T , where t k 1 …k n is the entry of T P given by the sum of all the entries of X belonging to the same co-cluster, should be interpreted as a probability. This has to be true for each possible co-clustering P of X , even for the discrete co-clustering (the co-clustering containing only singletons), whose contingency tensor T P is X . Typical tensors of this kind are those in which the n modes represent different variables, each element on a mode is a possible scenario (or value that the variable can assume), and each entry of the tensor is the count of the occurrences of the intersection of n scenarios. For instance, a words-documents matrix or an authors-words-conferences tensor are suitable choices. However, tensors of non-negative real numbers, in which all the entries represent homogeneous measurements of the same quantity under different scenarios, can also fit. Suppose now we have two different partitions P and Q on the same tensor X , corresponding to two different vectors P , Q ∈ [0, 1] n . There is no obvious order relation in [0, 1] n , so it is not immediately clear which one between P and Q is "better" than the other.
In Ienco et al. (2013), the authors, in order to compare partitions, adopt a dominance-based approach that induce a partial-order over ℝ n . They introduce the notion of (3) Pareto-dominance for partitions and state that an optimal solution for the co-clustering problem is one that is not-dominated by any other solution. We formally define these concepts, in our tensor co-clustering framework, below.
Definition 1 (Pareto dominance) Let X be a n-modal tensor and let P and Q be two partitions on X . We say that partition P dominates partition Q , in symbols P ≻ Q , if P Pareto dominance relation induces a partial order relation over the set ℙ(X) of all partitions on X . It means that, given two partitions P and Q , we can always say whether P dominates Q or not, but it is possible that P  Q and Q  P . As a consequence, it is not guaranteed that a unique maximum (with respect to relation ⪰ ) does exist in ℙ(X).
Definition 2 (Pareto optimal partition) We say that a partition P on tensor X is a Paretooptimal partition if P is not dominated by any other partition. In symbols, P is an optimal partition if P ⊀ Q for any Q ∈ ℙ(T).

A stochastic local search approach to tensor co-clustering
Our co-clustering approach can be formulated as a multi-objective optimization problem: given a tensor X with n modes and dimension m i on mode i, an optimal co-clustering P for X is one that is not dominated by any other co-clustering Q for X . Since we do not fix the number of clusters, the space of possible solutions is huge (for example, given a very small tensor of dimension 10 × 10 × 10 , the number of possible partitions is 1.56 × 10 15 ): it is clear that a systematic exploration of all possible solutions is not feasible for a generic tensor X . For this reason we need to find a heuristic that allows us to reach a "good" partition of X , i.e. a partition P with high values of P X k for all modes k. With this aim, we propose a stochastic local search approach to solve the maximization problem.

Tensor co-clustering algorithm
Algorithm 1: τ T CC(X , N iter ) Input: A tensor X with n modes and shape m 1 × · · · × mn, the maximum number of iterations N iter Result: P 1 , . . . , Pn 1 Initialize P 1 , . . . , Pn with discrete partitions; x ← next(x, k) //Select the element following the one selected at iteration i − 1 on mode k; Compute contingency tensor T j associated to partition Q j ;

19
Compute τ Q j using Equation (3)  Algorithm 1 provides the general sketch of our tensor co-clustering algorithm, called TCC . It repeatedly considers one mode by one, sequentially, and tries to improve the quality of the co-clustering by moving one single element from its original cluster to another cluster on the same mode. We will present in the following paragraphs different ways to measure the improvement in the quality of the partition at each iteration (function SelectBestPartition in Algorithm 1), but all the different approaches we will consider can be plugged in the general framework described in Algorithm 1 and explained below.
The partitions on each mode are initialized with the discrete partitions (each element stays in a cluster on its own). At each iteration i, fixed the k-th mode, the algorithm randomly selects one cluster C k b and one element x ∈ C k b . Then it tries to move x in every other cluster C k e and in the empty cluster C k e = � : among them, it selects the one that most improves the quality of the partition, according to the criterion chosen to measure it (see Sect. 4.2). Of course, if there is not any move that increases the quality of the partition, the selected object is left in the original cluster C k b . When all the n modes have been considered, the i-th iteration of the algorithm is concluded. These operations are repeated until a stopping condition is met; we decide to stop the algorithm when no further moves are possible. Because of the stochasticity in the choice of the element to move at each iteration, we cannot be sure that all moves have been tried even if the algorithm has been stuck in the same solution for several iterations. For this reason, when the number of iterations without moves exceeds a given threshold (we set this threshold equal to the dimensionality of the largest mode), we change the object selection strategy and we select, sequentially, all the objects on all the modes. If all objects have been tried but no move is possible, the algorithm ends. Nonetheless, we also include a parameter N iter to control the maximum number of iterations.
At the end of each iteration, one of the following possible moves has been done on mode k: -an object x has been moved from cluster C k b to a pre-existing cluster C k e : in this case the final number of clusters on mode k remains the same (let us call it c k ) if C k b is non-empty after the move. If C k b is empty after the move, it will be deleted and the final number of clusters will be c k − 1; -an object x has been moved from cluster C k b to a new cluster C k e = � : the final number of clusters on mode k will be c k + 1 (the useless case when x is moved from C k b = {x} to C k e = � is not considered); -no move has been performed and the number of clusters remains c k .
Thus, during the iterative process, the updating procedure is able to increase or decrease the number of clusters at any time. This is due to the fact that, contrary to other measures, such as the loss in mutual information (Dhillon et al. 2003), measure has an upper limit which does not depend on the number of co-clusters and thus enables the comparison of co-clustering solutions of different cardinalities.

Neighboring partition selection criteria
As seen above, our co-clustering framework tries to move one element in one fixed mode from its original cluster to another cluster which maximizes the quality improvement of the tensor partition. Since we need to optimize the set { P X k } n k=1 of n objective functions (one for each mode of the tensor), we can define different ways to measure this increase, corresponding to different ways to implement function SelectBestPartition in Algorithm 1. Suppose the algorithm is performing step i of the algorithm: during this step, it considers the k-th mode of the tensor and selects an object x in cluster C k b . Function SelectBestPartition takes a set of candidate co-clusterings and their respective values of as input, and has to decide which of them is the best one. In the following, we provide the details of different selection strategies.

Alternating optimization of X k
Since all the candidate co-clusterings differ only in the partition on the k-th mode, we can look at the k-th partitions only and select the one with highest value of k . In case of ties, the partition with the highest average is selected. The move is made only if where Q e and Q b are the co-clusterings having x ∈ C k e and x ∈ C k b respectively (in the k-th partition), while the partitions on all the other modes of the tensor are the same. We call this strategy SelectBestPartition ALT (see Algorithm 2).
Input: The mode k of the tensor, the original cluster b of the selected object, The idea behind this selection strategy is that the alternating optimization of the single components X k should lead to a final vector ( X 1 , … , X n ) with high values in each component.
However, this optimization strategy has a drawback: since the choice of the best move on mode k is done by looking only at the partition on the k-th mode, it is possible that, after the move, the overall quality of the co-clustering decreases. In Fig. 2 we propose a toy example to better explain this concept. Suppose we are applying our algorithm to a 2-way tensor (a matrix), having on the X mode all the clients of a shop and on the Y mode all the products sold. Each entry of the matrix represents the quantity of each product bought by each customer.
There are three well separated co-clusters in X: the first co-cluster consists of costumers who buy product 1, 2 and 3, the second co-cluster represents costumers who buy products 4 and 5, and last co-cluster includes customers who buy product 6.
After some iterations, the algorithm finds five clusters on the X axis and three on the Y axis, with the contingency matrix T of Fig. 2a. Then it selects the last row and tries to move it. There are five possible moves, as shown in Fig. 2b. X has the highest value for e = 0 and, according to Algorithm 2, the last row goes in the first cluster, even if it is clear that the row is 'more similar' to those in clusters 3 and 4. Furthermore, after this move the algorithm will necessarily end with the partition having contingency table T final in Fig. 2c, while it is evident that a "more desirable" co-clustering of X is T correct in Fig. 2c. This intuitive assessment is also confirmed by the fact that the average measured on T correct (0.771) is higher than the one measured on T final (0.713).
The reason of this behavior is that the algorithm decides where to move the selected row by looking only at the value of X . A more suitable choice would have been to move the last row in cluster 2 or 3, but this means that the algorithm has to look at Y as well. Furthermore, we need a way to decide which combination of X and Y is preferable. In the following subsections we will present some alternative optimization methods, with the aim of mitigating the issue illustrated above.

Optimization of avg()
A way to compare real-valued vectors is to use a scalarization function ℝ n ⟶ ℝ and to exploit the natural order in ℝ . Here, we use a function that maps each vector = ( . As a consequence, ℙ(X) inherits the total-order structure of (ℝ, ≤) and it is always possible to decide which partition, among a finite set, is the best one.
The above consideration gives us a criterion to select a partition among the set of candidates proposed at each step of the algorithm: the best co-clustering Q is the one with the highest avg• (Q) . This means that the selected element x on mode k is moved from its original cluster C k b to the cluster C k e which maximizes avg( ) . If there are several clusters C k e 1 , … , C k e r which maximize avg( ) , the arrival cluster is randomly selected among them. The move will be executed only if avg( e ) > avg( b ) . We call this strategy SelectBestPartition AVG (see Algorithm 3).
2 1 2 0 0 0 2 2 1 0 0 0 2 2 2 0 0 1 0 0 0 2 1 0 0 0 0 1 1 0 (c) Fig. 2 A 2-way tensor to be partitioned and the related contingency matrix obtained by Algorithm 2 after some iterations a. Rows 1, 2, and 3 are in the first cluster; rows 4 and 5 in the second cluster; all the other rows form singleton clusters. Columns 1, 2, and 3 are in the first cluster; columns 4 and 5 are in the second cluster; column 6 forms a singleton cluster. In b, the table reports the values of when moving the last row of X in any row cluster C e of T. The final contingency tables are shown in (c): T final is the contingency matrix obtained with Algorithm 2, T correct is a more desirable final result Input: The mode k of the tensor, the original cluster b of the selected object, This strategy has many theoretical advantages over the previous one: it works with a unique objective function and each solution is necessarily better than the previous solutions. Nevertheless, there is a disadvantage with this approach: by looking only at the partitions that increase the objective function avg( ) we are reducing the search space. Therefore, there is a greater risk to getting stuck in a poor-quality local optimal solution. In fact, if there is no move that improves avg( ) , the algorithm ends with a sub-optimal partition P , while with the alternating optimization strategy we would have been able to move from P and continue with the optimization, potentially reaching a final result with greater avg( ) . Furthermore, as we will show experimentally in Sect. 5, when an object on mode k is moved, usually the increase of X k is compensated by a decrease of (some of) the other X j , for j ≠ k : this could be a serious issue when the number of modes n is elevated, because the decrease of ∑ j≠k X j is often greater than the increase of the single X k , and the algorithm remains stuck in the initial discrete solution. Finally, this method is computationally more expensive than the previous one, because it requires the computation of all X j , while the alternating optimization strategy requires the computation of X k only.

Aggregate optimization of
Algorithm 2 maximizes only X k when moving an object on mode k, ignoring all other X i ( i ≠ k ). Instead, Algorithm 3 looks at the whole vector P and choose the partition which maximizes avg( ) . Here we propose an alternative method that stays in the between: it is an alternating maximization of the single X k , according to the mode k considered at the moment, but it adds a term (X j ) j≠k |X k to the objective function. This addend takes into account the aggregate modification of the other components of induced by the move on k-th mode. In more details, in Sect. 3.1 we have generalized the Goodman and Kruskal's measure to n modes as the reduction of the error in predicting one variable when all the other variables are known; we can also define another generalization, i.e. the reduction of the error in predicting the joint value of all the other variables when X k is known. Reasoning as in Sect. 3.1, we have that The best partition among those considered by this strategy is the one with highest value of (in case of ties we look at the best X k |(X j ) j≠k ). In this way, we require that the best partition among the neighboring ones is one that increases X i with a decay of the quality of the partitions on the other modes that, overall, is less important than the improvement on mode k. We call this aggregate-based strategy SelectBestPartition AGG (see Algorithm 4).

Alternative alternating optimization of X k
All the three methods proposed above perform a move only when the respective objective function ( X k in Algorithm 2, avg( ) Algorithm 3 or X k |(X j ) j≠k + (X j ) j≠k |X k in Algorithm 4) increases its value. If there is no move able to increase the value of the objective function, no move is done. Here we propose a slightly different strategy for the alternating optimization of . Suppose we are considering partition P and we want to move an object on mode k: we consider only those moves that improve (or at least do not worsen) X k and, among them, we choose the one with the greatest value of avg( ) . Notice that we do not require to increase the value of avg( ) with respect to partition P : we perform the move if there is any improvement (even little) of X k (as in Algorithm 2) and we choose the cluster with the highest avg( ) . Ties are solved in favor of the partition with the highest X k . This method is called SelectBestPartition ALT2 , and is sketched in Algorithm 5. As we will show in Sect. 5, this method usually achieves better results than the others and still exhibits a good convergence behavior.

3
Algorithm 5: SelectBestP artition ALT 2 (k, b, (τ Q j ) j=1,...,c ) Input: The mode k of the tensor, the initial cluster b of the selected object, (τ Q j )j = 1, ..., c where each τ Q j is a n-dimensional vector (τ Q j 1 , . . . , τ Q j n ) Result: e index of the selected partition among the c proposed

Alternative aggregate optimization of
Finally, we propose a criterion (named SelectBestPartition AGG2 ) based on the same selection strategy as the previous one, but applied to function X k |(X j ) j≠k + (X j ) j≠k |X k . More in details, the algorithm considers only the moves which improve X k and, among them, chooses the one with highest value of X k |(X j ) j≠k + (X j ) j≠k |X k . The strategy is described in Algorithm 6.
Algorithm 6: SelectBestP artition AGG2 (k, b, (τ Q j ) j=1,...,c ) Input: The mode k of the tensor, the initial cluster b of the selected object, Result: e index of the selected partition among the c proposed In the remainder of the paper, we refer to the five selection strategies as ALT (for Algorithm 2), AVG (for Algorithm 3), AGG (for Algorithm 4), ALT2 (for Algorithm 5), and AGG2 (for Algorithm 6).

Local convergence of TCC
A partition P is locally optimal with respect to a set of neighboring solutions N(P) if P is not dominated by any other solution Q ∈ N(P) . In Ienco et al. (2013) the authors show that their matrix co-clustering algorithm based on the multi-objective optimization of converges to a Pareto local optimum, with respect to the following neighboring function Although the same property holds for TCC as well, here we prove a slightly stronger local convergence property for three strategies, namely ALT, AVG and ALT2.
Theorem 1 If TCC (with selection strategy ALT, AVG or ALT2) ends within t < N iter iterations, then it returns a Pareto local optimum with respect to the following neighboring function which considers, as neighboring partitions of P , all those differing from P in the cluster assignment of a unique element x in a unique mode k.
We demonstrate the property for every selection strategy.
-AVG. When the algorithm ends in less than N iter iterations, all objects on all modes have been considered for a move, but no move has been actually performed: this means that any co-clustering Q obtainable by moving one single element in one single mode has avg( Q ) ≤ avg( P ) . This implies that P is not dominated by Q , for any Q ∈ N(P) , i.e., it is a Pareto local optimum w.r.t. N -ALT. Let Q ∈ N(P) . Q differs from P in the cluster assignment of a unique element x on mode k. Object x has been selected by the algorithm in one of the last max i=1,…,n (m i ) iterations, but no move has been done: it means that either P k > Q k or P k = Q k and avg( P ) ≥ avg( Q ) (because ties are solved in favor of the partition with highest avg( ) ). In both cases P ⊀ Q . Thus P is a Pareto local optimum w.r.t. N . -ALT2. The proof is identical to the ALT case. ◻ While the convergence to a local optimum w.r.t. neighboring function N k,b,x is always guaranteed, the convergence w.r.t. neighboring function N can be proved only when the algorithm ends within t < N iter iterations. As a rule of thumb, we suggest to set N iter equal to ten times the sum of the dimensions on all the modes of the tensor. According to our experiments this is a "safe" threshold: although there is no theoretical prove that the algorithm will reach the convergence within this number of iterations, it always happens in our exeriments and with a large margin of tolerance (see Sect. 5.2).

Optimized computation of
In step 19 of Algorithm ,1 fixed a mode k, the following quantities are computed: where c k is the number of clusters on mode k (including the empty set) and Q e is the coclustering obtained by moving an object x from cluster C k b to cluster C k e . A way to compute these quantities is to fix an arrival cluster C k e , move x in C k e obtaining partition Q e , compute the contingency tensor T e associated to that partition (using Eq. 2) and compute vector e associated to tensor T e (using Eq. 3 for strategies ALT, ALT2, and AVG and, additionally, Eq. 4 for AGG and AGG2). By repeating these steps for every e ∈ {1, … , c k } , we obtain a matrix V = ( X j (T e )) ej of shape c k × n . If the variant of the algorithm is AGG or AGG2, matrix V has shape c k × 2 , because instead of computing all the X j we only compute X k and (X j ) j≠k |X k , where k is the mode considered at that moment. Then we pass matrix V as input to one of the variants of function SelectBestPartition, which determines where to move x. In order to obtain V in a more efficient way, we can reduce the amount of calculations by only computing the variation of e from one step to another. We take advantage of the fact that a large part in the formula remains the same when moving a single element from a cluster to another. Hence, an important part of the computation of can be saved.
Imagine that x has been selected in cluster C 1 b and that we want to move it in cluster C 1 e (for simplicity we consider x on the first mode, but all the computations below are analogous on any other mode k). Object x is a row on the first mode (let's say the j-th row) of tensor X and so x can be expressed as a tensor M ∈ ℝ m 2 ×⋯×m n + with n − 1 modes, whose generic entry is k 2 …k n = x jk 2 …k n . We will denote with M the sum of all elements Q e X j for each j = 1, … n, for each e = 1, … c k of M . Let T and (T) be the tensor and the measure associated to the initial co-clustering and S and (S) the tensor and the measure associated to the final co-clustering obtained after the move. Tensor S differs from T only in those entries having index k 1 ∈ {b, e} . In particular, for each k i = 1, … , c i and i = 2, … , n: Replacing these values in Eq. 1, we can compute the variation of X 1 moving object x from cluster C 1 b to cluster C 1 e as: only depend on T and then can be computed once (before choosing b and e). Thanks to this approach, instead of comput- , and both operations are executed only once for each mode in each iteration. In a similar way, we can compute the variation of X j for any j ≠ 1 (this computation is needed only when variants AVG and ALT2 are used): only depends on T and can be computed once for all e. Consequently, instead of computing m i times X j in Algorithm 1 with a complexity in in the worst case with the discrete partition. Computing j is in O(m j ) and is done only once for each mode in each iteration.
Similarly, when using variants AGG and AGG2, instead of calculating (X j ) j≠k |X k entirely, we can compute: s bk 2 …k n = t bk 2 …k n − k 2 …k n s ek 2 …k n = t ek 2 …k n + k 2 …k n s k 1 k 2 …k n = t k 1 k 2 …k n , if k 1 ∉ {b, e}.
only depends on T and can be computed once for all e. Thus, instead of computing m i times (X j and is done only once for each mode in each iteration.
Hence, at each iteration and for each mode k, instead of computing matrix V = ( X j (T e )) ej with computational complexity O ((max i Based on the above considerations, for a generic square tensor with n modes, each consisting of m dimensions, the overall complexity is in O(Inm n ) for strategies ALT, AGG and AGG2 and in O(In 2 m n ) for strategies AVG and ALT2, where I is the number of iterations. This difference is due to the fact that the first group of strategies require the computation of just a fixed number of 's for each mode (one in the ALT case, two in the AGG and AGG2 cases), independently of the number of modes n, while ALT2 and AVG require the computation of all the n 's for each mode. The computational complexity of the two groups of methods differs by a factor of O(n): this could be a discriminant factor in the choice of the method only for tensors with a large number of modes.

Experiments
In this section, we report the results of the experiments we conducted to evaluate the performance of our tensor co-clustering algorithm. The section is organized as follows: first, we describe both the synthetic and real-world datasets used in our experiments; second, we compare the different variants of our algorithm by also analyzing their convergence behavior; third, we report the quantitative results of the comparative analysis between our algorithm and some state-of-the-art competitors; finally, we provide some qualitative insights on the co-clusters obtained in one specific case.

Datasets
The synthetic data we use to assess the quality of the clustering performance are boolean tensors with n modes, created as follows. We fix the dimensions m 1 , … , m n of the tensor and the number of embedded clusters c 1 , … , c n on each mode. Then, we first construct a block tensor of dimensions m 1 × m 2 × ⋯ × m n with c 1 × c 2 × … c n blocks. The blocks are created so that there are "perfect" clusters in each mode, i.e., all rows on each mode belonging to the same cluster are identical, while rows in different clusters are different. Then we add noise to the "perfect" tensor, by randomly selecting some element t k 1 …k n , with k i ∈ {1, … , m i } , for each i ∈ {1, … , n} , and changing its value (from 0 to 1 or vice versa). The amount of noise is controlled by a parameter ∈ [0, 1] , indicating the fraction of elements of the original tensor we change. We generate tensors of different number of modes, size, number of clusters and value of noise ( = 0.05 to 0.3 with a step of 0.05).
We also apply the algorithms to three real-world datasets (see Table1 ). The first dataset is the "four-area" DBLP dataset 1 . It is a bibliographic information network extracted from DBLP data, downloaded in the year 2008. The dataset includes all papers published in twenty representative conferences of four research areas (database, data mining, machine learning and information retrieval), five in each area. Each element of the dataset corresponds to a paper and contains the following information: authors, venue and terms in the title. The original dataset contains 14376 papers, 14475 authors and 13571 terms. Part of the authors (4057) are labeled in four classes, roughly corresponding to the four research areas. We select only these authors and their papers and perform stemming and stop-words removal on the terms by using the functions provided by the NLTK Python library 2 (in particular, we use the Porter stemmer). We obtain a dataset with 14328 papers, from which we create a ( 6044 × 4057 × 20)-dimensional tensor, highly sparse (99.98% of entries are equal to zero); the generic entry t ijk of the tensor counts the number of times term i was used by author j in conference k.
The second dataset is the "hetrec2011-movielens-2k" dataset 3 published by Cantador et al. (2011). It is an extension of MovieLens10M dataset, published by GroupLeans research group 4 . It links the movies of MovieLens dataset with their corresponding web pages at the Internet Movie Database (IMDb 5 ) and the Rotten Tomatoes movie review systems 6 . From the original dataset, only those users with both rating and tagging information are retained, for a total of 2113 users, 10197 movies (classified in 20 overlapping genres) and 13222 tags. Then, we select the users that have tagged at least two different movies, the tags that have been used for at least ten different movies, and the movies that received a tag from at least five different users. Finally, starting from the remaining data, we create two different tensors: -MovieLens1: it includes all the movies classified as 'Animation', 'Documentary', or 'Horror'. Since we need unique labels to assess the quality of our co-clustering algorithm, we keep only the movies with a unique genre label. At the end we obtain a ( 215 × 181 × 142)-dimensional tensor. The class on the movie mode are quite imbalanced: there are only 11 documentaries, while the remaining movies are divided into Animation (63)  The last dataset is a subset of the Yelp dataset 7 . It is a subset of Yelp's businesses, reviews, and user data. Among all available data, we select only the reviews about Italian, Mexican and Chinese restaurants with at least ten reviews and the users who write at least five reviews. Finally, we pre-process the text of the remaining reviews by performing both stemming and stop-word removal and by retaining the words appearing at least 5 times in at least one category of restaurants, plus the 150 most frequent words (regardless of the category). At the end, we obtain two different tensors, with restaurants on the first mode, users on the second mode, and words used in the reviews on the third mode: -yelpTOR: it includes the restaurants of the city of Toronto. The final tensor has shape (626,178,458) and contains 1885 reviews about 234 Italian restaurants, 288 Chinese restaurants and 104 Mexican restaurants. We consider the type of restaurant (Italian, Chinese or Mexican), as the labels on the first mode. -yelpPGH: it includes the restaurants of the city of Pittsburgh. The final tensor has shape (237,95,544), containing data about 104 Italian restaurants, 64 Chinese and 63 Mexican restaurants.

Comparison of the different variants of TCC
In this section we apply the five different versions of TCC (ALT, AVG, ALT2, AGG , AGG2) to synthetic and real-world data with the aim of comparing their overall performances and convergence behavior. We first apply the algorithms on synthetic data, varying the number of modes and the shape of the tensor ( 100 × 100 × 100 , 1000 × 100 × 20 , 100 × 100 × 100 × 100 , and 1000 × 100 × 20 × 20 ) and the number of embedded clusters on each mode ((5,5,5), (5,3,2), (10,5,3), and (10,5,3,2)), with a medium level of noise of 0.15. As shown in Fig. 3, the AVG variant of TCC provides less accurate results than the other variants: in cubic tensors (tensors with the same dimensionality on all modes) all the methods achieve similar levels of avg( ) , but AVG requires more iterations. On asymmetric tensors, AVG ends in a solution with lower average compared with the one obtained by the other methods. The algorithms have a similar behavior on real-world tensors (Fig. 4): the one with the overall best results in terms of avg( ) is ALT2, followed by AGG2 and ALT. It is worth noting that the avg( ) grows even with variants that do 1 3 (a) 100 X100X100(5, 5, 5) (b) 1000 X100X20(5, 5, 5) (c) 100 X100X100(5, 3, 2) (d) 1000 X100X20(5, 3, 2) (e) 100X100X100(10, 5, 3) (f) 1000 X100X20(10, 5, 3) (g) 100 X100X100X100(10, 5, 3, 2) (h) 1000X100X20X20(10, 5, 3, 2) Fig. 3 Avg( ) per iteration, for all the TCC variants, on synthetic data, varying the shape of the tensor and the number of embedded co-clusters not optimize it directly. In real-world data, although during the very first iterations avg( ) decreases, it begins to grow monotonically, with the exception of some small low peaks. Of course, AVG variant is not affected by this behavior, since it optimizes avg( ) directly. Moreover, as anticipated in Sect. 4.2, the direct optimization of avg( ) often results in a relatively poor local optimum. This is because a relaxed constraint on the neighborhood search allows the algorithm to explore more solution subspaces, thus ending up with a better final objective function value. After these preliminary experiments we conclude that ALT2 seems to be the most effective method, with ALT2, AGG2 and ALT outperforming the other two variants of algorithm TCC. We also conduct a Friedman statistical test followed by a Nemenyi post-hoc test (Demsar 2006) in order to assess whether the differences among the best three variants are statistically significant. At confidence level = 0.01 , the null hypothesis of the Friedman test (stating that the differences are not statistically significant) can be rejected for avg( ) values; we then proceed with the post-hoc Nemenyi test. The results show that the differences between the average rank of ALT2 and those of the other methods are more than the critical difference CD = 0.19168 at confidence level = 0.01 . Consequently the null hypothesis of the Nemenyi test passed, and we can conclude that ALT2 is statistically better than AGG2 and ALT. Nevertheless, hereinafter, we will consider all the three best performing variants in our experiments, while we will not report the results for AVG and AGG .

TCC against state-of-the-art competitors
In this section, we compare our results with those of other state-of-the-art tensor coclustering algorithms, mainly based on CP (Harshman 1970) and Tucker (Tucker 1966) decomposition. Additionally, we include another very recent approach based on the latent block model. Hence, we consider the following algorithms: -nnCP. It is the non-negative CP decomposition and can be used to co-cluster a tensor, as done by Zhou et al. (2009), by assigning each element in each mode to the cluster corresponding to the latent factor with highest value. The algorithm requires as input the number r of latent factors: we set r = max j=1,…n (c j ) , where c j is the true number of classes on the j-th mode of the tensor. Since the CP model represents the tensor as the sum of r rank-1 decompositions, the number r of latent factors is the same on all modes. However, the rank r of the decomposition represents the maximum number of clusters that can be found on each mode of the tensor, thus the fact that we specify the same number r of latent factors on all the modes does not force the algorithm to identify exactly r clusters on each mode. This is particularly important when the number of embedded clusters c j differs along the modes, because the algorithm is allowed to identify a number of clusters that is less then the maximum number r. -nnCP+kmeans. It combines CP with a post-processing phase in which k-means is applied on each of the latent factor matrices. Here, we set the rank r to r = max j=1,…n (c j ) + 1 and the number k i of clusters in each dimension equal to the real number of classes (according to our experiments, this is the choice that maximizes the performances of this algorithm). -nnTucker. It is the non-negative Tucker decomposition. Here we set the ranks of the core tensor equal to (c 1 , … , c n ). -nnT+kmeans. It combines Tucker decomposition with k-means on the latent factor matrices, similarly as what has been done by Huang et al. (2008) and Cao et al. (2015). -SparseCP. It consists of a CP decomposition with non-negative sparse latent factors (Papalexakis et al. 2013). We set the rank r of the decomposition equal to the maximum number of classes on the n modes of the tensor. It also requires one parameter i for each mode of the tensor: for the choice of their values we follow the instructions suggested in the original paper. -TBM. It performs tensor co-clustering via the Tensor Block Model (Wang and Zeng 2019). As parameters, it requires the number of clusters on each mode and a penalty coefficient ; the number of clusters is set equal to the correct number of classes, while is tuned as suggested in the original paper.
The available codes of SparseCP and TBM only work with 3-way tensors, so we have to exclude these methods when we perform experiments on tensors with more than three modes.
To assess the quality of the clustering performances, we consider two measures commonly used in the clustering literature: normalized mutual information (NMI) (Strehl and Ghosh 2002) and adjusted rand index (ARI) (Hubert and Arabie 1985).
All experiments are performed on a server with 32 2.1GHz Intel Xeon Skylake cores, 256GB RAM, Ubuntu 20.04.02 LTS (kernel release: 5.8.0) 8 . In the following, we first present the comparative results obtained on synthetic data, then we report the performances achieved by our algorithms and the competitors on real-world data.

Results on synthetic data
We test the performances of TCC against those of its competitors on synthetic data with embedded block co-clusters, constructed as described in Sect. 5.1. We consider tensors with 3, 4 and 5 modes, with different shapes ( 100 × 100 × 20 , 100 × 100 × 100 , 1000 × 100 × 20 and 1000 × 500 × 20 for 3-way tensors, 100 × 100 × 100 × 100 and 1000 × 100 × 20 × 20 for 4-way tensors and 100 × 100 × 100 × 100 × 100 and 1000 × 100 × 20 × 20 × 20 for 5-way tensors), different numbers and shapes of block co-clusters (combinations of 2,3,5 and 10 clusters on each mode) and with three levels of noise (0.1, 0.2 and 0.3), for a total of 276 tensors. We run all the experiments ten times and compute the average NMI and ARI. Figs. 5, 6, 7 and 8 report the results in terms of average NMI of all the experiments. The results in terms of mean ARI are similar and are presented in the appendix (see Figs. 13,14,15 and 16). In these figures we include only the best variant of the algorithm (referred to as TCC ALT2 ), according to our previous analysis, for sake of clarity. We also omit to show the standard deviation of the experiments in the plots. However, the results are very stable: the standard deviation of TCC ranges from 0 to 0.001, while the algorithms with the highest variability are nnCP+kmeans and nnT+kmeans, whose standard deviation ranges from 0 to 0.004. In all the experiments our algorithm achieves quite "perfect" levels of NMI and ARI (always greater than 0.93), meaning that it is able to identify the correct co-clusters embedded in the tensors. The shape and the number of modes of the tensor and the asymmetry in the number of clusters on the different modes do not affect significantly the quality of the co-clustering. Furthermore, TCC ALT2 consistently outperforms SparseCP, TBM, nnTucker and nnCP on synthetic data. Finally, the results of TCC ALT2 are comparable with those of nnCP+kmeans and nnT+kmeans: only when the number of clusters on the three modes is different, TCC ALT2 's results are slightly lower than those of the kmeans-based algorithms. This is due to the fact that our algorithm fails in identifying the correct number of clusters in these scenarios: for instance, when the clusters on the three modes are 10, 5 and 3 respectively, TCC ALT2 identifies 9, 5 and 3 clusters. We don't have the same issue with k-means, for which, however, the correct number of clusters is given as input.

3
To further investigate this behavior, we analyze the results of the three variants of TCC on synthetic data (the detailed results are reported in the appendix, in Figs. 17,18,19 and 20). We find that, while all the variants of our algorithm find the correct clusters when the number of embedded clusters on all the modes are similar, the results degrade when we consider different numbers of clusters across the modes. This issue is more pronounced for ALT and AGG2, while ALT2 is able to find good or perfect clusters even in these scenarios (in particular, when m i >> k i for all i = 1, 2, 3 , where m i is the dimension of the tensor and k i the number of clusters on mode i).

Execution time analysis
In this section, we show the execution times of TCC on tensors with different number of modes and shape. We compare the execution times of TCC with those of its competitors; for sake of clarity, we exclude from the experiment nnT+kmeans and nnCP+kmeans, since the execution time of the post-processing K-means algorithm is negligible w.r.t. the execution time of the decomposition. Firstly, we run the different algorithms on 3-way tensors of increasing magnitude, starting from a tensor of shape 100 × 100 × 10 and adding from 100 to 900 dimensions to the first mode, until reaching a tensor of shape 1000 × 100 × 10 . Then we add from 100 to 900 dimensions to the second mode, until we reach a tensor of shape 1000 × 1000 × 10 . All the tensors have 5 clusters on the larger modes and 2 on the smallest ones. Figure 9a shows that all the variants of TCC are slower than their competitors (with the exception of SparseCP), approximately by a factor of 10. This depends on the fact that the number of iterations until convergence is higher for TCC than for the other methods. However, the trend of the curves are similar for all methods, as expected by looking at the theoretical complexities reported in Table 2. In the second and third experiments (Figs.9b and 9c) we start with the same 3-way tensor of shape 100 × 100 × 10 and then we increase   Fig. 9b a new mode of dimension 10 is added at each time, while in the experiment reported in Fig. 9c, at each step, we add a new mode of dimensionality 100. The plots show that the difference in the execution time between TCC and the other methods (in particular nnCP) decreases with (a) 3-way tensors (b) n-way tensors (c) n-way tensors

Results on real-world data
As last experiment, we apply our algorithm and its competitors on real-world datasets. Each algorithm is applied ten times on every dataset and the average results and standard deviations are presented in Tables 3, 4. Algorithms nnCP, nnTucker and their variant with k-means are applied with different parameters: we try different ranks of the decomposition (while k of k-means is fixed to the correct number of classes in the data) and we report the best result obtained. In this way we are giving a big advantage to our competitors: we choose the rank of the decomposition and the number of clusters by looking at the actual number of categories, which are unknown in standard unsupervised settings. Despite this, TCC (in all its variants) outperforms the other algorithms on all datasets but one (DBLP) and has comparable results on another (YelpTOR). As regards DBLP, non-negative Tucker decomposition (with the number of latent factors set to the correct number of embedded clusters) achieves the best results. Non-negative CP decomposition obtains results that are  Fig.10), the results get immediately worse than those of TCC. A similar observation holds for YelpTOR: Tucker decomposition achieves the best performances (just in terms of NMI, indeed) only when the number of latent factors equals the number of naturally embedded clusters. The number of clusters identified by TCC is usually close to the correct number of embedded clusters: on average, 5 instead of 4 for DBLP, 5 instead of 3 for MovieLens1, the correct number 3 for MovieLens2, 5 instead of 3 for YelpPGH. Only YelpTOR presents a number of clusters (13) that is far from the correct number of classes (3). However, more than the 85% of the objects are classified in 3 large clusters, while the remaining objects form very small clusters: we consider these objects as candidate outliers. The same behavior is even more pronounced in DBLP, where four clusters contain the 99.9% of the objects and only 2 objects stay in the "extra cluster".

Qualitative evaluation of the results
Here, we provide some insights about the quality of the clusters identified by our algorithm. To this purpose, we choose a co-clustering of the MovieLens1 dataset, obtained with selection strategy AGG2. The results obtained by the other variants of TCC, however, are very similar both in the number and in the composition of the identified clusters.
When Algorithm 1 terminates, five clusters of movies are identified, instead of the three categories (Animation, Horror and Documentary) we consider as labels. The tag clouds in Fig. 11, illustrate the 30 movies with more tags for each cluster (text size depends on the actual number of tags): it can be easily observed that the first cluster concerns animated movies for children, mainly Disney and Pixar movies; the second one is a little cluster containing animated movies realized with the claymation technique (mainly Wallace and Gromit saga's movies or other films by the same director); the third cluster is still a subset of the animated movies, but it contains anime and animated films from Japan. The fourth cluster is composed mainly by horror movies and the last one contains only documentaries. On the tag mode, our algorithm finds thirteen clusters. Six of them contain more than 90% of the total tags and only 10 uninformative tags are partitioned in other 7 very small clusters, and could be considered as outliers. There is a one-to-one correspondence between four clusters of movies (Cartoons, Anime, Wallace&Gromit and Documentary) and four of the tag clusters; cluster Horror, instead, can be put in relation with two different tag clusters, the first containing names of directors, actors or characters of popular horror movies, the second composed by adjectives typically used to describe disturbing films. For more details, see Fig. 12. In a few cases, the cluster group of a movie does not coincide with the category label: for instance, Tim Burton's movies The Nightmare Before Christmas and Corpse Bride, which are labeled as "Animation" in the original dataset, have been included in the horror cluster by TCC algorithm. These movies, indeed, have more similarities with non-animated horror movies than with cartoons for children, and our co-clustering algorithm was able to capture that (even if they have been also given tags as "animated" and "claymation" that are typical of the first two clusters). This is probably due to the fact that TCC takes advantage of the tensor structure of the data, having the opportunity to look at both the tag and user modes when partitioning the movies: besides being tagged with the same words, similar films are also appreciated by the same kind of users. Unfortunately, we do not have any latent class information about the users.
To better understand how much the tensor structure helps to find better clusters on the main mode, we execute a further experiment: we try TCC on two 2-way tensors, T1 having movies and users as modes, and T2 with movies and tags as modes. Each cell of the matrix counts the number of times a movie has been tagged by a particular user (in T1) and the number of times a movie received a particular tag (in T2).  Table 5 Comparison of the results obtained on MovieLens1 dataset with 3D-TCC (on movie-user-tag tensor), 2D-TCC T1 (on movie-user matrix), 2D-TCC T2 (on movie-tag matrix) and CoStar on both T1 and T2 matrices simultaneously

3
We apply TCC algorithm on the two matrices independently and, finally, on the two matrices simultaneously, using the 2-way co-clustering algorithm for multi-view data based on the optimization of (CoStar) proposed by Ienco et al. (2013). The results are summarized in Table 5: they clearly show that the quality of the results (in terms of both NMI and ARI) is higher for the 3-way version of the algorithm than for the 2-way versions. Considering multiple views helps, but not to a great extent, indeed. These results suggest that movie clustering benefits from the tensorial structure of the data, drawing information not only from the movie-user or movie-tag relationships but also from the user-tag relationship.

Conclusions
The majority of tensor co-clustering algorithms optimizes objective functions that strongly depend on the number of co-clusters. This limits the correct application of such algorithms in realistic unsupervised scenarios. To address this limitation, we have introduced a new co-clustering algorithm specifically designed for tensors that does not require the desired number of clusters as input. We have proposed different variants of the algorithm, showing their theoretical and/or experimental convergence properties.
Our experimental validation has shown that our approach outperforms state-of-the-art methods for most datasets. Even when our algorithms are not the best ones, we have found that the competitors can not work properly without specifying a correct number of clusters for each mode of the tensor. As future work, we will design a specific algorithm for sparse tensors with the aim of reducing the overall computational complexity of the approach. Finally, we will further investigate the ability of our method to identify candidate outliers as small clusters in the data.