Auto-weighted multiple kernel tensor clustering

Multiple kernel subspace clustering (MKSC) has attracted intensive attention since its powerful capability of exploring consensus information by generating a high-quality affinity graph from multiple base kernels. However, the existing MKSC methods still exist the following limitations: (1) they essentially neglect the high-order correlations hidden in different base kernels; and (2) they perform candidate affinity graph learning and consensus affinity graph learning in two separate steps, where suboptimal solution may be obtained. To alleviate these problems, a novel MKSC method, namely auto-weighted multiple kernel tensor clustering (AMKTC), is proposed. Specifically, AMKTC first integrates the consensus affinity graph learning and candidate affinity graph learning into a unified framework, where the optimal goal can be achieved by making these two learning processes negotiate with each other. Further, an auto-weighted fusion scheme with one-step manner is proposed to learn the final consensus affinity graph, where the reasonable weights will be automatically learned for each candidate graph. Finally, the essential high-order correlations between multiple base kernels can be captured by leveraging tensor-singular value decomposition (t-SVD)-based tensor nuclear norm constraint on a 3-order graph tensor. Experiments on seven benchmark datasets with eleven comparison methods demonstrate that our method achieves state-of-the-art clustering performance.


Introduction
Multiple kernel clustering (MKC) aims to optimally integrate the consensus information among multiple base kernels to generate a consensus kernel for improving clustering performance.In the last few years, MKC methods have been widely applied into various applications, with benefiting from the powerful expression ability to handle non-linear data than traditional single kernel clustering methods [1][2][3][4][5].In general, existing MKC methods can fall into two main classes, i.e. linear kernel fusion (LKF) scheme-based methods and non-linear kernel fusion (NKF) scheme-based methods.
Usually, LKF assumes that a linear combination of multiple pre-defined base kernels can yield an optimal kernel.Motivated by this, SCMK [6] uses the linear combination of base kernels to learn an optimal kernel.Furthermore, LKGr [7] imposes an extra low-rank constraint on the base kernels to capture the structure information of base kernels by considering the low-rank property of the samples.Recently, many efforts have been devoted to improve the effectiveness of LKF by leveraging other non-linear kernel fusion schemes in the kernel matrix space.For instance, neighbor-kernelbased MKC [2] method non-linearly constructs multiple neighbor kernels preparing for yielding a consensus kernel by considering the neighborhood structure among base kernels.Nevertheless, in practical applications, LKF tends to over-reduce the feasible set of the consensus kernel.That is, the representation capability of consensus kernel fails to outperform single kernel in some cases [8].Therefore, the aforementioned MKC methods with LKF scheme will not always be effective in handling non-linear clustering tasks.
Different from LKF, NKF is developed to better satisfy the non-linear clustering, and it involves two intuitive assumptions: (1) the consensus kernel is a neighborhood of candidate kernels; (2) the larger weights are assigned to more important/similar candidate kernels, and vice versa.Hence, the flexibility and reliability of consensus kernel are significantly highlighted.Recently, NKF derived a mass of excellent MKC methods [8][9][10][11].For instance, SMKL [8] indirectly yields an affinity graph for clustering by using NKF.JMKSC [9] integrates the correntropy based NKF and block diagonal constraint to learn an affinity graph with optimal block diagonal property.Considering the suboptimal low-rank constraint existing in LKGr [7], LLMKL [10] proposes to use a lowrank substitute of consensus kernel to upgrade the affinity graph.SPMKC [11] captures the global and local structure over SMKL [8] to control similarity between consensus kernel and affinity graph, and improves the clustering performance significantly.Overall, the aforementioned LKF and NKF based MKC methods are referred to as multiple kernel subspace clustering (MKSC) [2,[6][7][8][9][10][11] methods, which typically work as follows: (1) generating multiple base kernels from single sample dataset; (2) fusing these kernels with LKF or NKF scheme to obtain a consensus kernel at the matrix level; (3) utilizing self-expressiveness subspace learning on the learned consensus kernel to adaptively learn an affinity graph for spectral clustering.
Although effective, they usually concentrate more on learning consensus kernel instead of affinity graph, which violates the intention of MKSC, i.e., the ultimate goal of MKSC is to learn an optimal affinity graph [3] for clustering.Therefore, CAGL [3] innovatively learns multiple candidate affinity graphs from multiple base kernels first, and then fuses these candidate graphs to directly learn a consensus graph at the matrix level.Note herein that CAGL adopts a two-step way to separately learn a consensus kernel graph.Although CAGL has achieved great progress compared with the other state-of-the-art MKSC methods, it still suffers from the following flaws: (1) ignoring the high-order correlations hidden in different base kernels essentially, such that the consistent and complementary information of the given multiple kernels may not be fully explored; (2) learning the weights of multiple candidate graphs in the matrix level rather than tensor level is short of correct guidance and sensitive to noise, which leads to unreliable weights; and (3) adopting a twostep way to learn the candidate affinity graphs and consensus affinity graph separately, such that the obtained solution is usually inferior.
In light of the aforementioned limitations, a novel MKSC method, dubbed auto-weighted multiple kernel tensor clustering (AMKTC) is proposed for improving the clustering.Concretely, by leveraging self-expressiveness subspace learning with multiple base kernels, multiple candidate affinity graphs are first learned.Then, we stack these graphs into graph tensor in reconstructed kernel Hilbert space, where the consistency and complementary of candidate graphs are fully exploited by imposing a tensor-singular value decomposition (t-SVD)-based tensor nuclear norm (t-TNN).Finally, with the high-quality candidate graphs at hand, an auto-weighted graph fusion scheme is developed to obtain an optimal consensus affinity graph by using the guidance of t-TNN constraint.In summary, the main contributions of this paper are summarized as follows: • This paper proposes a novel MKSC method, dubbed as AMKTC, to effectively handle non-linear data clustering.AMKTC integrates the consensus affinity graphs learning and candidate affinity graph learning into a unified framework, where they can be jointly learned via a mutually reinforce manner.• This paper proposes to explore the high-order correlations of base kernels by utilizing the t-TNN constraint on the rotated graph tensor.Meanwhile, auto-weighted graph fusion enforces the learning of consensus affinity graph with optimal weights.• Compared with the other state-of-the-art MKSC methods, the superiority of AMKTC is abundantly demonstrated by implementing extensive experiments.
The remainder of the paper is organized as follows.The next sections briefly review the several works related MKSC and tensor.The following section presents AMKTC method and its solver.The experimental results are shown in the next section.Last section presents the conclusion.

Preliminaries of tensor
In this section, to help understand tensor nuclear norm better, some definitions about matrix decompositions and tensor decompositions are introduced as below [12,13].
Definition 1 (Identity Tensor) The identity tensor I ∈ R n 1 ×n 1 ×n 3 satisfies that its first frontal slice is the identity matrix with size n 1 × n 1 while the others frontal slices are zero.

Definition 2 (Orthogonal
Definition 3 (Tensor Transpose) Denote B ∈ R n 2 ×n 1 ×n 3 as the transpose operator of the tensor B ∈ R n 1 ×n 2 ×n 3 .And its calculation is achieved by transposing all frontal slices of B.
where the definition of bcirc and bvec as shown in "Notations".

Definition 5 (Tensor Singular Value Decomposition) (t(-SVD))
The t-SVD of B is defined as where R n 1 ×n 2 ×n 3 , which is denoted as the sum of singular values of all the frontal slices of B f , i.e. where

Multiple kernel subspace clustering
Given a data matrix X = [x 1 , x 2 , . . ., x n ] ∈ R d×n , where d, n, and x i demonstrate sample dimensionality, sample size, and the i-th sample, respectively.As far as we know, the selfexpressiveness subspace learning (SESL) model [14][15][16][17][18] is formulated as min S 9(X, XS) + αR(S) (5) where α > 0 is a regularization parameter and 9 stands for the loss function, S is the desired coefficient matrix; and its regularization term R(S) is commonly replaced by sparse [19] or low-rank constraints [20].Usually, affinity graph can be calculated by S = (S +S)/2 for performing spectral clustering [21,22].However, Eq. ( 5) cannot well handle the non-linear data existing in practice extensively.Consequently, Eq. ( 5) can be extended to kernel space by using a kernel mapping function φ(•), i.e.
, multiple kernel subspace clustering (MKSC) can be achieved as min S

t-SVD-based tensor nuclear norm
Tensor computation has widely used in machine learning, signal processing, data mining, computer vision, remote sensing, and biomedical engineering [30].Due to the validity of the nuclear norm and tensor computation, [31] extends Eq. ( 5) to tensor version, i.e. min where Ŝ * = r k=1 ξ (k) S (k) * , the weights {ξ (k) } r k=1 > 0 are equivalent and r k=1 ξ (k) = 1 in [31]; Ŝ and S (k)  are the merging 3-order tensor and the unfolding matrixes, respectively; Ŝ * is the generalized tensor nuclear norm (g-TNN) on Ŝ.Note that g-TNN lacks a clear physical meaning for general tensors, and it is illogical for the same weights to penalize ranks of Ŝ [13].Different from g-TNN, t-TNN cannot only possess physical meaning clearly, but also can exploit high-order relationship among different affinity graphs [5,32].Besides, t-TNN works better than the unfolding-based TNN when the both are to capture the structural information for tensor [33].Therefore, t-TNN is employed in this paper to exploit structural information of tensor.

Auto-weighted multiple kernel tensor clustering
By leveraging self-expressiveness subspace learning with the r mapping function {φ (k) (•)} r k=1 , the kernelized self-expressiveness is formulated as min where Tr(•) is the trace of a square matrix.Note here that the kernel trick, defined as ker(x i , can avoid the trivial solutions (i.e., S (k) = I), where I is the identity matrix, we then have min where • 2 F is the Frobenius norm, and r candidate affinity graphs {S (k) } r k=1 can be learned.To fully capture both the consistent and complementary information among these r graphs, these graphs are stacked into a tensor S * ∈ R n×n×r = bvfold([S (1) ; • • • ; S (r ) ]) and rotate S * from R n×n×r to S ∈ R n×r ×n , i.e. S = rotate(S * ), as illustrated in Fig. 2. Compared with using S * , the computational complexity is largely reduced (as shown in "Computational complexity and convergence").
The different S (k) should possess some consensus structure information since r candidate affinity graphs are originated from one source.Additionally, graph tensor S should enjoy low-rank property, due to the fact that the number of samples is usually much bigger than the number of clusters.Thus, we impose the low-rank tensor constraint S on the rotated tensor.Different from traditional g-TNN, t-TNN possesses great properties for exploring the complementary information of candidate graphs [13].By imposing the t-Fig. 2 The rotated process of AMKTC for graph tensor TNN constraint, Eq. ( 9) can be updated to min (1) ; • • • ; S (r ) ]) (10) where α is a balance parameter, bvec(S) = [S (1) ; S (2) ; • • • ; S (n 3 ) ] is the block vectorizing function of S, and β > 0 is a parameter for controlling the contribution of t-TNN.Doing so implies that Eq. ( 10) can simultaneously explore consistent and complementary information in kernelized low-rank tensor space.
Once S is learned by Eq. ( 10), the averaged S = r k=1 S (k)  r can be obtained, and then performing spectral clustering.Nevertheless, in doing so, the equal weight for different graphs ignores the importance among graphs, so that the unreliable graphs may significantly decrease the clustering performance.In order to solve the above problem, a welldesigned self-weighted strategy [34] can be considered to assign the appropriate weights for different graphs, and its model is formulated as min where λ > 0 is the scalar to control the weights distribution; and z i j ≥ 0, z i 1 = 1 are used to guarantee probability property.Inspired by similar form applied in [35], Eq. ( 11) can be converted to the following form: where w = [w (1) , . . ., w (r ) ], and λ > 0 is a regularization parameter to control the contributions of different S (k) to Z.
On the one hand, if λ → 0, Eq. ( 12) may obtain a trivial solution, i.e.only the smallest distance between S (k) and Z corresponds to a weight w (k) = 1 and the other weights are zeros.While for λ → ∞, all the elements of w are equivalent to 1 r .Although effective, the linear combination of candidate graphs largely limit the representation ability of consensus graph.Besides, noise and outliers may result in inappropriate assignments for optimizing weights.On the other hand, λ in Eq. ( 12) also requires to be tuned in a large range, i.e. 0 → ∞, where the ideal λ with the optimal clustering performance varies for different datasets.
To address the aforementioned problems, we make the assumption that each candidate affinity graph can be deemed as a perturbation of consensus graph.It is worth noting that more important candidate graphs to the consensus graph should receive large weights, and smaller weights for suboptimal graphs.Motivated by this assumption, we develop an auto-weighted strategy to remove λ without degrading clustering performance by the following optimization problem: min where the constraints of w (k) are also removed so as to give the weights more freedom.The w (k) of Eq. ( 13) can be solved via Theorem 1.

Theorem 1 The weight corresponds to the k-th candidate affinity graph is computed by w
Proof Motivated by the iteratively re-weighted method in [36], an auxiliary problem without w is introduced as min which leads a Lagrange function as follows: where 3 and 2 (3, Z) indicates the Lagrange multiplier and indicator function of Z, respectively.Taking the derivative of Lagrange function w.r.t Z, Eq. ( 15) becomes where It is obvious that Eq. ( 16) is also the derivation of the Lagrange function of Eq. ( 13).Note herein that w (k) is dependent on the Z and thus cannot be solved directly.However, if we make Z stationary, w (k) can be deemed as a solution for problem Eq. ( 16).Theoretically, to avoid dividing zero, w (k) can be converted to where ζ is infinitely close to zero.
Learning S (k) and Z separately will cause the suboptimal solution.In addition, the tensor low-rank can suppress noise to further improve reliability of weights.To this end, we seamlessly integrating Eqs. ( 10) and ( 13) in a unified framework.Overall, the ultimate objection of AMKTC is formulated as min The learned consensus graph of Eq. ( 18) has superiorities as follows: (1) Z can capture the non-linear relationships well; (2) by utilizing t-TNN constraint on S, the high-order information can be captured underlying multiple base kernels, so that the more exact candidate affinity graphs {S (k) } r k=1 can lead to build the more effective and reasonable weights.Motivated by these merits, a high-quality Z can be obtained for improving clustering performance.

Optimization
Based on the alternating direction method of multipliers (ADMM) [37], an auxiliary variable A is first introduced to make Eq.( 18) separable as follows: min whose augmented Lagrangian function is formed as follows: min where μ and Y are the corresponding the penalty parameter and Lagrangian multiplier.Equation ( 20) can be solved by in turn calculating each variable with the remaining variables fixed.
Step 1. S-subproblem: By fixing the A, w and Z, we update {S (k) } r k=1 via min S (k)   Tr −2H (k) S (k) + (S (k) ) whose closed-form solution can be obtained by taking the derivative of above equation w.r.t S (k) and setting to zero.Thus, we then have where Step 2. A-subproblem: By fixing {S (k) } r k=1 and removing the irrelevant items, the optimization problem of A is min Let M = S − Y μ , this t-TNN minimization problem can be solved by applying the tensor tubal-shrinkage via Theorem 2 [13].
Theorem 2 For two 3-order tensor A ∈ R n 1 ×n 2 ×n 3 , M ∈ R n 1 ×n 2 ×n 3 and given a scalar ρ > 0, an optimization problem of low-rank tensor is whose the global optimal solution can be achieved via the following tensor tubal-shrinkage operator: where M = U * S * V , U and V are orthogonal tensor with the size of n 1 × n 1 × n 3 and n 2 × n 2 × n 3 , and Step 3. Z-subproblem: Eq. ( 18) becomes to the following subproblem: which can be changed to the form of vectors as min which can be effectively solved as [3].
Step 4. w-subproblem: Ignoring the irrelevant items and fixing the other variables, we update w via Theorem 1.
Step 5. Multipliers-subproblem: The multipliers involved ADMM is given by where τ 1 and μ max are scalars of ADMM.The stopping criterion is to satisfy the residuals condition max{|obj t+1 − obj t |, S t+1 −S t 2 F } ≤ , where obj, t and are the objective value of Eq. ( 18), the number of iteration, and the pre-defined threshold tolerance, respectively.Finally, the optimization procedure of AMKTC is summarized in Algorithm 1.

Computational complexity and convergence
Algorithm 1 involves five main subproblems, including problem (22), problem (23), problem (27), problem (17), and problem (28).And their corresponding computational complexities are as follows: Problem (22) involves the computational complexity of w with O(rn 2 ).The complexity of ( 23) is O(rn 3 ) due to its matrix inversion operator.The complexity of (27) first requires O(rn 2 log(n)) due to the FFT and inverse FFT of the tensor S ∈ R n×r ×n , and then the SVD of graph tensor S needs O(rn 2 )log(n).Thus, the complexity of problem (28) is O(rn 2 log(n) + r 2 n 2 ) in total under rotate operator.Note that if problem (28) does not performed rotate operator in Fig. 2, the complexity will increase to O(rn 2 log(r ) + rn 3 ), where r n.Polynomial kernel (a Cosine kernel Both problems ( 17) and ( 28) only involve the computational complexity with O(n).Theoretically, the computational cost of Algorithm 1 is approximate to O(tn 3 ), where t denotes the total number of iterations.Since t n and a small r in practice, the overall cost for our AMKTC can be deemed as O(n 3 ).Despite the computation complexity between the competitors and AMKTC is similar, clustering performance of the competitors is largely inferior to AMKTC.
Although Algorithm 1 is difficult to generally prove the convergence, Algorithm 1 can converge to the local optimum in a high probability since the subproblems are all have closed-form solutions.In addition, empirical evidence of "Optimization" demonstrates that Algorithm 1 has good convergence behavior.

Experiments
In this section, the effectiveness of our AMKTC is verified by conducting experiments on seven widely used datasets.

Benchmark datasets and kernel setting
In experiments, seven widely used datasets including Yale, Jaffe, AR, ORL, binaryalphadigs (BA), COIL-20, and Deep CIAR-10 (DC) are employed to evaluate the clustering 123 performance of AMKTC, where CIFAR is a dataset of deep learning feature constructed as the work [3].The details of these datasets are summarized in Table 1.Following the settings in [7,8,10,23], we construct 12 base kernels (i.e., r = 12 is the number of kernel functions.) and form a kernel pool as mentioned in Table 2, including seven radial basis function (RBF) kernels with ker(x i , x j ) = exp(− x i − x j 2 2 /(2τ 2 σ 2 )), where τ 2 varies in the range of {0.01, 0.05, 0.1, 1, 10, 50, 100} and σ is the maximum distance between samples; four polynomial kernels with ker x i , x j = a + x T i x j b , where a = {0, 1} and b = {2, 4}; a cosine kernel with ker x i , x j = x T i x j / x i • x j • Finally, all the kernel matrices H (k) r k=1 are normalized to [0, 1] range.

Evaluation metrics
Three most used evaluation metrics, including clustering accuracy (ACC), normalized mutual information (NMI), and purity are used to evaluate the clustering performance of our proposed AMKTC.The higher the value indicates the clustering performance is better.
For fairness, we tune parameters of competitors by following the recommended values of respective authors in their papers.Further, we repeat competitors for 20 times and report the average results.The concrete details about these k-means based methods are described as follows: • MKKM [24] is a multiple kernel extension of the fuzzy k-means.• RMKKM [24] is the robust extension of the MKKM.
• MVCLFA [25] divides pre-defined kernel matrixes into multiple low-dimensional partitions, which are to learn a consensus partition for MKC.
Please see the details of these rest MKSC comparison methods in "Introduction".

Experimental results and analysis
Tables 3, 4 and 5 report the average clustering results of competitors, where the standard deviations in these tables are not shown since they are less that 1%.
• As shown in Tables 3

Parameter sensitivity
Three parameters α, β and γ of AMKTC are required to be set properly.For simplicity, γ is fixed as γ = 0.0001, which can always obtain the best clustering performance in terms of all evaluation datasets experimentally.Specially, the parameters α and β are to balance the effect of S (k) 2 F and S , respectively.By leveraging a grid search technique with step size 10, we tune both α and β from 10 −5 to 10 2 .Take the AR, ORL, COIL-20, BA and Deep CIFAR-10 datasets for example, as shown in Figs. 3 and 4, the best performance of AMKTC on COIL-20 dataset can be achieved by setting α ∈ [10 −2 , 10 −1 ] and β ∈ [10 −5 , 10 −4 ], while on the residual evaluation datasets, the best performance can be obtained Table 3 The ACC of some compared MKC methods on seven datasets The first and second best results are highlighted in red and blue Table 4 The NMI of some compared MKC methods on seven datasets The first and second best results are highlighted in red and blue Table 5 The purity of some compared MKC methods on seven datasets The first and second best results are highlighted in red and blue

Visualization of clustering result
Original data and the clustering results of two methods (i.e., our AMKTC and the second best competitor CAGL) are visualized on ORL and DC datasets in Fig. 5. First, the clustering results of AMKTC and CAGL can clearly achieve the better structure distribution than the original data in both datasets.In addition, although the different color points of CAGL in Fig. 5b and e have the scattered structure distribution, it still exists some incorrect cluster partitions compared to our method.While in contrast, Fig. 5c and f of AMKTC only has a fewer wrong partitions than the second best competitor CAGL.Especially as mentioned in Fig. 5f, there is just a little bit of wrong partitions.This phenomenon demonstrates that our method has good extendibility of partition for different datasets.

Convergence
In this section, we evaluate the convergence of the proposed AMKTC method on seven datasets (i.e., AR, BA, CIFAR-10, COIL-20, ORL and Yale), and then the results are shown in Fig. 6.Obviously, the iterator algorithm converges rapidly till to the stable point, even some of which meet the stop criterion within just eight epochs.

Conclusion
In this paper, a novel multiple kernel subspace clustering method, namely AMKTC, is proposed to address the challenging problems of capturing high-order correlations hidden in the different base kernels.Meanwhile, a non-linear autoweighted graph fusion scheme is used to learn consensus affinity graph with optimal weights.In AMKTC, we integrate the candidate affinity graph learning, the graph tensor learning and auto-weighted consensus graph learning in a unified objective function, such that consensus affinity graph with optimal weights is learned for clustering.Compared to the state-of-the-art MKC methods, the superior performance of our methods are demonstrated via extensive experimental results.In the future, we will study tensor train and tensor ring to further enhance the representation capability of graph tensor.

Table 1
Summaries of the seven used datasets

Table 2
The choice of kernel function [3]le for MKSC methods, they adopt the widely used self-expressiveness subspace learning (SESL) to learn an affinity graph, and then employs spectral clustering to obtain the final clustering results, so as to effectively capture the subspace structure.Moreover, it can be observed that MKSC methods perform better than k-means based MKC ones in most cases.In these MKSC methods, it is noted that only CAGL and the proposed AMKTC are pure graph learning ones, while the other methods persist in LKF or NKF schemes.This violates the intention that the high-quality affinity graph is the expectation of MKSC methods[3].From the results, it is obvious that CAGL and AMKTC are superior to the other competitors.