GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning

Graph learning is being increasingly applied to image clustering to reveal intra-class and inter-class relationships in data. However, existing graph learning-based image clustering focuses on grouping images under a single view, which under-utilises the information provided by the data. To address that, we propose a self-supervised multi-view image clustering technique under contrastive heterogeneous graph learning. Our method computes a heterogeneous affinity graph for multi-view image data. It conducts Local Feature Propagation (LFP) for reasoning over the local neighbourhood of each node and executes an Influence-aware Feature Propagation (IFP) from each node to its influential node for learning the clustering intention. The proposed framework pioneeringly employs two contrastive objectives. The first targets to contrast and fuse multiple views for the overall LFP embedding, and the second maximises the mutual information between LFP and IFP representations. We conduct extensive experiments on the benchmark datasets for the problem, i.e. COIL-20, Caltech7 and CASIA-WebFace. Our evaluation shows that our method outperforms the state-of-the-art methods, including the popular techniques MVGL, MCGC and HeCo.


Introduction
In computer vision, clustering is traditionally understood as a single view problem, where an algorithm focuses on the holistic features of individual data samples to group them. However, these samples can result from multiple interpretations or representations of the original data. For instance, we can generate different sets of samples as Gabor [14], CLD [19] and HOG [6] descriptors of the images. These representations may hold complementary properties that can be leveraged for improved clustering. This fact has recently piqued interest of the computer vision community, resulting in an emerging topic of multi-view clustering (MVC) [2,10,23,26,29,30,46,48,58].
Another contemporary line of research for image clustering favors graphbased methods [16,37,42,43,45]. The key advantage of graphs is that they possess an intrinsic ability to encode data structure information, which is beneficial for the clustering problem. For instance, methods like [20,21,28,37,42,44,52] leverage trained Graph Convolutional Networks (GCNs) for images to reason about the linkage likelihoods between a given node and its neighbours for graph completion, thereby achieving more accurate clusters.
In general, graph-based methods are known to benefit from Contrastive Learning (CL) [4], which induces models using self-supervision. It trains a model by maximizing the agreement of its prediction on the samples that are transformations of the same original sample. For graphs, the analogous Contrastive Graph Learning (CGL) paradigm aims to maximize the prediction agreement on different views of the same underlying graph [10,17,30,48,58]. These views are created by applying random operations, e.g., adding/deleting nodes/edges and dropping features, to an original graph. In line with the negative sample creation in CL, the CGL considers other original graphs as the negative samples. It learns node-level (intra-view) or graph-level (inter-view) representations -illustrated in Fig. 1(a) -with a graph neural network and a contrastive loss function.
The self-supervised CGL paradigm naturally suites to the multi-view perspective. For instance, Hassani and Khasahmadi [11] and Wang et al. [36] created different graph views and then utilized node-level and graph-level representations for multi-view contrastive learning. These methods consider structural semantics as global information for learning the node-level embeddings, neglecting the fact that each node can also have various features to provide more information. Coming back to our main problem of multi-view image clustering, existing methods generally first compute a data affinity matrix for raw features or learned representation under multiple views, and then perform clustering using the affinity matrix [5,13,18,27,38,40,49,50]. These methods concatenate multiple views to construct a denoised homogeneous graph for image clustering. We provide a simple illustration of a multi-view homogeneous graph for image data in Fig. 1(b), where views are defined using compositional properties. The graph denoising however, can lead to the loss of important semantic information. Additionally, the fusion of multiple views as a homogeneous graph can render the heterogeneous properties of multi-view data meaningless. In theory, this identifies that there is a potential to leverage more complementary information for multi-view image clustering by treating images under heterogeneous graphs - Fig. 1(c).
Considering the above narrative, in this work, we propose an inductive Multi-view Image Clustering framework with self-supervised contrastive heterogeneous Graph co-learning (GoMIC). In GoMIC, we maintain the relationships between different views as a heterogeneous affinity graph, while preserving the uniqueness and independence of each view. Our heterogeneous graph consists of several homogeneous affinity graphs - Fig. 1(d). By constructing the heterogeneous affinity graph, the local neighborhood information from each view is easily available to each node. We devise two encoding strategies to learn node feature propagation. In the first, we propagate feature from a node to its neighbourhood in its own and other views through several hops - Fig. 1(e). The second strategy is influence-aware propagation that learns how each node feature propagates towards the densest nodes - Fig. 1(f). Both encoding strategies employ a proposed attention mechanism to control the feature influence of different nodes. Furthermore, to better exploit the learned embeddings under CGL, we explicitly contrast each pair of nodes for mutual information maximisation. We also customise the contrastive loss function to fit our contrastive objective.
Our key contributions are summarised below.
1. To the best of our knowledge, this is a pioneering multi-view image clustering method using heterogeneous information networks that leverages contrastive graph learning. 2. We devise two novel heterogeneous information network encoder strategies and an influence attention mechanism to learn the embedding of each node according to its local feature propagation and influence-aware feature propagation, respectively. 3. We enhance the loss function for contrastive graph learning to consider the mutual first neighbouring nodes as the mutual positive samples. 4. We conduct comprehensive experiments on three benchmark datasets, not only demonstrating large improvements over the existing self-supervised heterogeneous graph methods, but achieving better results than popular supervised methods across the datasets.
We discuss related work in Section 2. Section 3 demonstrates our proposed framework, GoMIC. In Section 4, we introduce three open multi-view image datasets and evaluate our proposed framework with comparing to 6 state-ofthe-art MVC methods. Finally, Section 5 concludes this paper.

Prior Art & Background
We discuss the related work below. This discussion also includes analytical details that are later utilized in discussing the methodology.

Heterogeneous graph neural network
In recent years, heterogeneous graphs are becoming increasingly popular in neural network research. For instance, Wang et al. [34] studied the use of hierarchical attentions to depict node-level and semantic-level structures in heterogeneous graphs. Similarly, Fu et al. [9] incorporated intermediate nodes of meta-paths in the networks. Yun et al. [47] developed GTN to automatically identify useful graph connections. A technique dubbed MAHINDER is proposed in [55,57] to employ and encode meta-paths over different views with attention on the importance of attributes and data views. In an unsupervised setting, a heterogeneous graph neural network is proposed in [51], which samples a fixed size of neighbours and fuses their features using LSTMs. Zhao et al. [53] focused on network schema and preserved pairwise and network schema proximity simultaneously. Hu et al. [12] devised node-and edge-type dependent parameters to characterise heterogeneous attention over graph edges. The above methods rely strongly on supervised signals of data to encode graphs, whereas graph structures among the nodes are neglected. In [35], a heterogeneous network HeCo is proposed, which generates meta-paths and network schemas and exploits contrastive learning to further use signals of data in a self-supervised manner. Wang et al. [31] and Cai et al. [3] created item clusters and entity clusters to organise the objects and their nearby entities in the knowledge graph. After that, the hierarchically combining the heterogeneity data derived from the clusters with the weights produced by the hierarchical attention layers yields the representations. Nevertheless, encoding graphs and nodes while comprehensively considering node relations and graph structures still remains largely unresolved for the method.

Feature propagation
Considering that we devise feature propagation scheme in our technique, it is imperative to discuss related research in this direction in more detail. Expanding a node's feature by propagation is commonly conducted under a generalization of pagerank equation [24,39], which can be expressed as where X contains the original features, A encodes the adjacency, W 1 and W 2 are coefficient matrices. However, Eq. (1) is not naturally invertible. Therefore, Dornaika [7] modified it with the degree matrix D as follows The above is convergent if W 2 is non-negative along other conditions. Still, this can only allow feature propagation on homogeneous graphs. Zhao et al. [54] attempted to extend feature propagation to heterogeneous graphs. They assumed that there are two types of nodes in a heterogeneous graph. They first obtained a learned feature similarity graph for each type with a threshold sparsifying the feature similarities. Next, they generated the feature propagation graph for each type. Finally, the overall feature graph is obtained via channel attention [47]. Incidentally, there is a huge computational footprint of this method when dealing with a heterogeneous graph with large number of relations. More importantly, this method is more directed to heterogeneous graph sparsification rather than heterogeneous feature propagation.

Contrastive loss function
Contrastive Graph Learning (CGL) is derived from contrastive learning (CL) [4] for graph learning. CGL has been increasingly researched recently [10,11,30,48,58] and has achieved excellent performance on graph or node classification by generating and contrasting positive and negative graph view pairs. Here, we organize our review by mainly focusing on contrastive loss function of the related CGL studies, which is helpful in understanding our contribution in Section 3.5. In [10,11,15,30,32,56], the authors adopt the learning objective of CL rather straightforwardly. In doing so, they focused on node-level representations, and neglected the graph-level information. In [48,58], for any node v i , its embedding generated in one view v ′ i and the embedding in the other view v ′′ i , form a positive pair, whereas embeddings of other nodes are negative samples. The pairwise objective for each positive pair where sim denotes the function computing cosine similarity, τ is the temperature parameter, ❊ identifies contrasting of inter-view negative pairs as Connecting the above back to the heterogeneous graph neural networks, Wang et al. [35] proposed collaboratively contrastive optimization to expand the scope of defining positive samples and used it for self-supervised learning for heterogeneous graphs. However, in this state-of-the-art contrastive objective for heterogeneous graph learning, there is still a lack of consideration on feature propagation in graphs. Also, the frequent use of thresholds in the existing techniques decreases the feasibility of proposed models.

Proposed Approach
In this section, we discuss the proposed multi-view image clustering with selfsupervised contrastive heterogeneous graph co-learning (GoMIC), illustrated in Fig. 2. Our method encodes nodes from the local neighbourhood context and influence-oriented context, which fully captures contrastive structure of the heterogeneous affinity graph. This ensures that our approach well involves clustering boundary nodes (i.e., vague nodes) in the computations. During the encoding, we design innovative attention mechanisms, which learn feature propagation embeddings naturally. Moreover, in our method, a novel contrastive graph learning framework enables embeddings to supervise each other informatively.

Preliminaries
Before discussing our technique in detail, we first formalize the relevant concepts to present our technique for multi-view image clustering with contrastive heterogeneous graph learning.
Definition 2 (Heterogeneous Affinity Graph) A heterogeneous affinity graph G = (V, E, φ, ψ) is constructed from given and multi-view data. V and E are the node and the edge sets of the original X, φ is the view-based node type mapping function, and ψ is the view-based edge type mapping function. We let φu(V ) denote the node embedding of the u-th view, and φ 0 (V ) = V , i.e., the original node features are the 0-th view. By analogy, ψ 0 (E) = E. Also, we let φu(v i ) = m u i and ψu(e i ) = p u i , where p u i indicates an edge e i in the u-th view.

GoMIC
Definition 3 (Feature Propagation) Given the node feature set X = {x i } N i=1 and the edge weight set W = {w i,j }, where i, j ∈ [1, N ] and i ̸ = j, the propagated feature of a node v i is governed by where x j indicates the propagated feature set of v i neighbours, {w i,j } is the edge weight set of edges between v i and its neighbours, P is feature propagation function and θp is the propagation parameter. From the first to l-th hop, the feature propagation influence decreases.
Definition 5 (Influential Node) The target node v i can walk l steps to find its influential node v i+l , which identifies a node with maximum degrees (i.e., degree centrality) and/or maximum density (i.e., density centrality) in the view-based subgraph of v i .

Heterogeneous Affinity Graph Construction
We construct the heterogeneous affinity graph G = (V, E, ϕ, ψ) from multiple views of images based on feature similarity. An edge in our graph indicates the possibility of two nodes having the same label. The graph G consists of M + 1 homogeneous affinity graphs, which are related by the connections between the original view and each descriptor view, i.e., there are M + 2 edge types in G.
For each node v i and the original feature x i , its u-th view feature is ϕ u (x i ). According to these features, we adjust instance pivot subgraph (IPS) [37] to build the heterogeneous affinity graph G following the steps mentioned below.
Step 1: Feature extraction. In a single-view image dataset, given a node v i , we utilise M different descriptors (e.g., Gabor [14], HOG [6]) to generate multiple views of v i - Fig. 2(a). This results in M + 1 views of v i and different view-based feature vectors, including the original x i , where the n-th viewbased feature vector of v i is denoted as x n i . We note that, for benchmark multiview image datasets [8,22], standard multiple views of images are already available.
Step 2: Neighbourhood construction. In each view, we utilise h-hop kNN to build its neighbourhood-based subgraph. Let k t denote the k nearest neighbours at the t-th hop, where t = 1, 2, ..., h. As t increases, the neighbourhood influence towards v i decreases. Hence, the number of connecting nearest neighbours k t decreases as well. We add graph edges and their weights along with the neighbourhood discovery. For instance, for an edge e i,j between v i and v j , their distance d i,j is computed using their sparse construction error c i,j as Then, the weight between v i and v j , i.e. w i,j is defined as where s ij is the similarity score (i.e., Euclidean distance) between node v i and v j , and s ij = s ji . Thus, we get the homogeneous affinity graph of each view. To constitute these affinity graphs as a heterogeneous affinity graph G = (V, E, ϕ, ψ) - Fig. 2(b), we connect each node v i with its corresponding other view-based nodes, where each edge weight is kept 1.
Step 3: Node density calculation. Based on the constructed heterogeneous affinity graph, we define the density for a node v i in the graph as where N k1 (v i ) is the first hop k nearest neighbours of the node v i , andx i , x j are the ℓ 2 -normalized feature embeddings of node v i and v j . According to this formula, the density of v i is equal to the average of its similarity with its neighbours. Higher density nodes are considered more discriminative and are more influential in identifying cluster centres.

Local Feature Propagation Encoder
The Local Feature Propagation (LFP) in our technique governs the interaction among the neighbourhood nodes. We aim to learn the propagated feature embedding of node v i through its neighbourhood, i.e. to find its neighbours with similar features. Here, we conduct feature propagation on each view-based homogeneous affinity graph, then process them with cross-view contrastive learning - Fig. 2(c). General aggregation strategies like mean-pooling and max-pooling cannot identify if nodes are important, i.e., their mutual first neighbours cannot be emphasised. We employ the following two steps to obtain the embedding of each node v i from the sight of local neighbourhood.
Step 1: Feature propagation. Based on Eq.
(2) and discussion in Section 2.2, we incorporate influence of the first neighbours in feature propagation. To explian the concept, we take the n-th view-based node m n i as an example. Its feature propagated embedding, which considers feature information and structural information simultaneously by emphasising the importance of mutual first neighbours, is computed as where (vi)=vj ] is the indicator function for the mutual first neighbour enhancement during feature propagation. It governs the weight of the first neighbour's influence towards the target node m n i . In Eq. (8) d i ∈ D, where D is the degree matrix. Other notations follow from Section 2.2. For conciseness, the text below also avoids repeating explanation of other notations.
Step 2: Cross-view contrastive learning. After having the feature propagated embeddings of nodes in different M views, we feed them to a shared MLP which has a hidden layer to prepare them for the contrastive loss. Here, we follow the classic strategy of defining positive and negative samples. That is, different view-based nodes that are generated from the same original node form a positive node pair, while others are negative. However, our method aims at multi-view contrastive learning. Therefore, for the propagated n-th view-based node m n i , we define the following contrastive loss for LFP where N (v i ) indicates the nodes in the v i -oriented subgraph, and ✶ [i̸ =b] ∈ {0, 1} is an indicator function that equals to 1 if i ̸ = b. Then, the overall cost objective is given as Through the cross-view contrastive objective, we optimise the encoder via back-propagation and learn the embeddings z LPF i of nodes for v i .

Influence-aware Feature Propagation Encoder
We also aim to learn the embedding of node v i under influence-aware feature propagation (IFP) - Fig. 2(d). For the target node v i , different views contributes differently to its embedding. The embedding can not only be affected by the nearest neighbours, but also by the influential nodes in its neighbourhood of various views. Thus, we devise the influence-aware feature propagation encoder at node-level and view-level to hierarchically aggregate underlying information through the shortest paths from the target node v i towards the influential node v i+l (excluding the edge between v i and v i+l ) in different viewbased subgraphs (see Def. 6). Take the n-th view as an example, we define a path from m n i to its influential node m n i+1 as p(m n i ) = {m n i , · · · , m n i+l } (see Def. 6).
Each path is processed via a GCN-based encoder to learn feature propagation. As shown Fig. 2(d), for each node (as a target node), its corresponding paths in various views will be the input. Having the path p(m n i ) = {m n i , · · · , m n i+l }, we define the affinity matrix A(m n i ) ∈ R |p(m n i )|×|p(m n i )| with the initial feature matrix denoted as X(m n i ). In the k-th layer of GCN, we update the feature matrix as where X k−1 (m n i ) denotes the updated features of the k-1-th GCN layer for all nodes on the path of (m n i ), , σ is the ReLU activation, α is a learnable parameter that balances the importance of the updated features, and W k−1 is the transformation parameter. Through GCN, we can receive the embedding of m n i as h n i . Next, we employ attention mechanism in view-level to hierarchically aggregate context information from other views to the target node v i . We firstly compute the importance of the n-th view as where A ⊤ IFP indicates the transpose of affinity graph A(m n i ), W IFP are learnable parameters for IFP, and B IFP denotes view-level attention. Then, we compute the final embedding as follows

Dual-Context Contrastive Graph Learning
To maximise the mutual information between each pair of embeddings generated from LFP and IFP, we design a dual-context contrastive loss. To that end, we extend the definition of positive samples - Fig. 2(e). That is, for a node v i , not only its LFP and IFP embeddings (i.e., z LFP i and z IFP i ) are mutually positive, but also the embeddings of its mutual first neighbour v j (i.e., z LFP j GoMIC and z IFP j ) would be considered as its positive samples. We denote the positive sample set of v i as P i . This aims to contribute to nailing down clustering centres. According to the extension of positive sample definition, we formulate the dual-context contrastive loss function as We describe the computation of L(z LFP i ) below. The L(z IFP i ) is computed analogously. We let In (15), Neg. refers to contrasting against negative samples, which is defined as where two indicators separate out the negative samples. The overall contrastive objective of maximising mutual information is then where hyper-parameter θ controls the relative importance of the two embeddings.

Experimental Setup
Datasets: To establish the effectiveness of our technique, we perform bechmarking of our method on popular standard multi-view image datasets.  0.9] with a step size 0.05, and α and θ are both tuned in the range [0.1, 0.9] with a step size 0.1. Moreover, encoders only conduct aggregation once, i.e., we use 2-layer GCN on LFP and IFP. At the end of GoMIC, the images to be clustered are represented as a graph. Each edge in the graph is associated with a similarity weight in [0, 1]. To generate the clusters, we visit every node and only preserve its neighborhood nodes with the largest weight, i.e., the other neighbor nodes are disconnected from the node. Thus, the clusters get formed in an efficient manner. Table 1 reports the results of several popular multi-view clustering methods and a state-of-the-art self-supervised heterogeneous contrastive graph learning technique, HeCo. The results reveal that multi-view clustering methods which make use of more views usually achieve higher performance. This explains why MIC [25] and DCCAE [33] generally have inferior clustering results. Since MIC relies on the best single view to conduct representation learning, and DCCAE correlates view-based graphs for embeddings pair-by-pair, they are not able to perform as well as other methods. In contrast, methods proposed more recently (MVGL [50], MCGC [50] and RHLC-CAGL [13]) aim to leverage more information from multiple views, which achieves better performance. In the table, HeCo [35] deals with natural heterogeneous networks instead of directly dealing with multi-view images. Nevertheless, it performs reasonably well on this benchmark due to the suitability of heterogeneous graphs to the problem. This is inline with our intuition of exploiting heterogeneous properties of image views for clustering. Therefore, it is not surprising that the performance of our approach, GoMIC, is superior to these above-mentioned strong baselines. Our method not only contrasts multiple views, but also exploits two newly devised encoding schemes of feature propagation for improved contrastive learning.

Ablation Study and Parameter Analysis
GoMIC encodes multi-view based graphs with two innovative encoding schemes, namely; LFP and IFP. Also, for better clustering centres, we adjust the contrastive loss function by extending the positive sample definition. To understand the impact of these factors in the overall performance of our technique, we introduce three variants of GoMIC to conduct an ablation study. These three variants are respectively denoted as: (1) GoLFP, which contains only LFP as the encoder; (2) GoIFP, which only contains IFP as the encoder and (3) GoMIC-n/e, which has no extension of the positive sample definition.
The results of these variants on all three datasets are summarized in Table 2.
In the table, we can observe that GoMIC eventually outperforms all these variants by a considerable margin, establishing the benefits of synergising these proposed components. Furthermore, the performances of GoLFP and GoIFP decrease differently when applied to different datasets. This highlights that, for different cases, LFP and IFP are able to contribute differently to the overall performance. A consistent gain of GoMIC over GoMIC-n/e also ascertains the importance of our positive sample definition extension. From the parameters viewpoint, GoMIC has two major hyper-parameters, α and θ. We show the influence of adjusting their values on performance in Fig. 3. The chosen range values are {0.1, 0.3, 0.6, 0.9}.

Conclusions
We introduced an innovative multi-view image clustering approach GoMIC, which leverages heterogeneous properties of multi-view image data under contrastive graph learning to understand relationships within image datasets from each node's local neighbourhood and influence-aware context. To extract and exploit more underlying information, we devised two strategies to encode the graphs, Local Feature Propagation (LFP) and Influence-aware Feature Propagation (IFP), to represent each node-based subgraph in two contrasting contexts. Also, we employed two contrastive loss functions, and adjusted them to fit the use of LFP and IFP. The first loss function aims to integrate multiple view-based LFP embeddings, and the second nails down the clustering centres with an extended positive sample definition for contrastive graph learning. Experimental results show that our proposed method consistently outperforms the state-of-the-art methods on multi-view clustering benchmarks. Also, our ablation study demonstrates explicit contribution of each novel aspect in our overall technique. Currently, the framework works with the common assumption of balanced data. In the future, we will extend it to also handle imbalanced data.

Declarations Ethical Approval and Consent to participate
Not applicable

Human and Animal Ethics
Not applicable

Consent for publication
The authors confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. The authors further confirm that the order of authors listed in the manuscript has been approved. The authors confirm that they have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing the authors confirm that they have followed the regulations of our institutions concerning intellectual property. The authors are consent for publication.

Availability of supporting data
The data that support the findings of this study are openly available in COIL-20 [22] at https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php,