GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning

Fang, Uno; Li, Jianxin; Akhtar, Naveed; Li, Man; Jia, Yan

doi:10.1007/s11280-022-01110-6

GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning

Open access
Published: 12 October 2022

Volume 26, pages 1667–1683, (2023)
Cite this article

Download PDF

You have full access to this open access article

World Wide Web Aims and scope Submit manuscript

GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning

Download PDF

Uno Fang¹,
Jianxin Li¹,
Naveed Akhtar²,
Man Li¹ &
…
Yan Jia³

3133 Accesses
18 Citations
1 Altmetric
Explore all metrics

Abstract

Graph learning is being increasingly applied to image clustering to reveal intra-class and inter-class relationships in data. However, existing graph learning-based image clustering focuses on grouping images under a single view, which under-utilises the information provided by the data. To address that, we propose a self-supervised multi-view image clustering technique under contrastive heterogeneous graph learning. Our method computes a heterogeneous affinity graph for multi-view image data. It conducts Local Feature Propagation (LFP) for reasoning over the local neighbourhood of each node and executes an Influence-aware Feature Propagation (IFP) from each node to its influential node for learning the clustering intention. The proposed framework pioneeringly employs two contrastive objectives. The first targets to contrast and fuse multiple views for the overall LFP embedding, and the second maximises the mutual information between LFP and IFP representations. We conduct extensive experiments on the benchmark datasets for the problem, i.e. COIL-20, Caltech7 and CASIA-WebFace. Our evaluation shows that our method outperforms the state-of-the-art methods, including the popular techniques MVGL, MCGC and HeCo.

Self-supervised Multi-view Clustering Framework with Graph Filtering and Contrast Fusion

Consensus representation-driven structured graph learning for multi-view clustering

Article 29 June 2024

Multiview Learning via Non-negative Matrix Factorization for Clustering Applications

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Clustering is typically thought of as a single view problem in computer vision, where an algorithm groups individual data samples based on their overall qualities. These samples, however, may be the outcome of various interpretations or representations of the underlying data. For instance, we can generate different sets of samples as Gabor [1], CLD [2] and HOG [3] descriptors of the images. These representations may hold complementary properties that can be leveraged for improved clustering. This fact has recently piqued interest of the computer vision community, resulting in an emerging topic of multi-view clustering (MVC) [4,5,6,7,8,9,10,11,12].

Another contemporary line of research for image clustering favors graph-based methods [13,14,15,16,17]. The main benefit of graphs for the clustering problem is that they naturally have the capacity to encode data structure information. For instance, methods like [13, 14, 18,19,20,21,22] leverage trained Graph Convolutional Networks (GCNs) for images to reason about the linkage likelihoods between a given node and its neighbours for graph completion, thereby achieving more accurate clusters.

In general, graph-based methods are known to benefit from Contrastive Learning (CL) [23], which induces models using self-supervision. During training, it maximises the agreement between its predictions and the transformed samples of the original sample. For graphs, the analogous Contrastive Graph Learning (CGL) paradigm aims to maximise the prediction agreement on different views of the same underlying graph [4,5,6,7, 24]. These views are created by applying random operations, e.g., adding/deleting nodes/edges and dropping features, to an original graph. In line with the negative sample creation in CL, the CGL considers other original graphs as the negative samples. It learns node-level (intra-view) or graph-level (inter-view) representations - illustrated in Fig. 1(a) - with a graph neural network and a contrastive loss function.

The self-supervised CGL paradigm naturally suites to the multi-view perspective. For instance, [25] and [26] created different graph views and then utilised node-level and graph-level representations for multi-view contrastive learning. These methods consider structural semantics as global information for learning the node-level embeddings, neglecting the fact that each node can also have various features to provide more information. Coming back to our main problem of multi-view image clustering, existing methods generally first compute a data affinity matrix for raw features or learned representation under multiple views, and then perform clustering using the affinity matrix [27,28,29,30,31,32,33,34]. These methods concatenate multiple views to construct a denoised homogeneous graph for image clustering. We provide a simple illustration of a multi-view homogeneous graph for image data in Fig. 1(b), where views are defined using compositional properties. The graph denoising operations, however, can lead to the loss of important semantic information. Additionally, the heterogeneous properties of multi-view data may become meaningless if several views are combined into a homogeneous graph. Theoretically, by treating images as nodes in heterogeneous graphs, it is possible to use more complementary information for multi-view image clustering - Fig. 1(c).

Considering the above narrative, in this work, we propose an inductive Multi-view Image Clustering framework with self-supervised contrastive heterogeneous Graph co-learning (GoMIC). In GoMIC, we maintain the relationships between different views as a heterogeneous affinity graph, while preserving the uniqueness and independence of each view. Our heterogeneous graph consists of several homogeneous affinity graphs - Fig. 1(d). Each node can readily get the local neighbourhood data from each view by creating the heterogeneous affinity graph. To understand the propagation of node features, we created two encoding schemes. In the first, we propagate feature from a node to its neighbourhood in its own and other views through several hops - Fig. 1(e). The second strategy is influence-aware propagation that learns how each node feature propagates towards the densest nodes - Fig. 1(f). Both encoding strategies employ a proposed attention mechanism to control the feature influence of different nodes. Furthermore, to better exploit the learned embeddings under CGL, we explicitly contrast each pair of nodes for mutual information maximisation. We also customise the contrastive loss function to fit our contrastive objective.

Our key contributions are summarised below.

1.
To the best of our knowledge, this is a pioneering multi-view image clustering method using heterogeneous information networks that leverages contrastive graph learning.
2.
We devise two novel heterogeneous information network encoder strategies and an influence attention mechanism to learn the embedding of each node according to its local feature propagation and influence-aware feature propagation, respectively.
3.
We enhance the loss function for contrastive graph learning to consider the mutual first neighbouring nodes as the mutual positive samples.
4.
We conduct comprehensive experiments on three benchmark datasets, not only demonstrating large improvements over the existing self-supervised heterogeneous graph methods, but achieving better results than popular supervised methods across the datasets.

We discuss related work in Section 2. Section 3 demonstrates our proposed framework, GoMIC. In Section 4, we introduce three open multi-view image datasets and evaluate our proposed framework with comparing to 6 state-of-the-art MVC methods. Finally, Section 5 concludes this paper.

2 Prior art & background

We discuss the related work below. This discussion also includes analytical details that are later utilised in discussing the methodology.

2.1 Heterogeneous graph neural network

In recent years, heterogeneous graphs are becoming increasingly popular in neural network research [35, 36]. For instance, [37] studied the use of hierarchical attentions to depict node-level and semantic-level structures in heterogeneous graphs. Similarly, [38] incorporated intermediate nodes of meta-paths in the networks. [39] developed GTN to automatically identify useful graph connections. A technique dubbed MAHINDER is proposed in [40, 41] to employ and encode meta-paths over different views with attention on the importance of attributes and data views. In an unsupervised setting, a heterogeneous graph neural network is proposed in [42], which samples a fixed size of neighbours and fuses their features using LSTMs [43]. [44, 45] focused on network schema and preserved pairwise and network schema proximity simultaneously. [46] devised node- and edge-type dependent parameters to characterise heterogeneous attention over graph edges. The above methods rely strongly on supervised signals of data to encode graphs, whereas graph structures among the nodes are neglected. In [47], a heterogeneous network HeCo is proposed, which generates meta-paths and network schemas and exploits contrastive learning to further use signals of data in a self-supervised manner. [48] and [49] created item clusters and entity clusters to organise the objects and their nearby entities in the knowledge graph. After that, the hierarchically combining the heterogeneity data derived from the clusters with the weights produced by the hierarchical attention layers yields the representations. Nevertheless, encoding graphs and nodes while comprehensively considering node relations and graph structures still remains largely unresolved for the method.

2.2 Feature propagation

Considering that we devise feature propagation scheme in our technique, it is imperative to discuss related research in this direction in more detail. Expanding a node’s feature by propagation is commonly conducted under a generalisation of pagerank equation [50, 51], which can be expressed as

$$\begin{aligned} \widetilde{\varvec{X}}=\varvec{X} \mathcal {W}_{1}+A \widetilde{\varvec{X}} \mathcal {W}_{2}, \end{aligned}$$

(1)

where $\varvec{X}$ contains the original features, A encodes the adjacency, $\mathcal {W}_{1}$ and $\mathcal {W}_{2}$ are coefficient matrices. However, (1) is not naturally invertible. Therefore, [52] modified it with the degree matrix D as follows

$$\begin{aligned} \widetilde{\varvec{X}}=X \mathcal {W}_{1}+D^{-1} A \widetilde{\varvec{X}} \mathcal {W}_{2}. \end{aligned}$$

(2)

The above is convergent if $\mathcal {W}_{2}$ is non-negative along other conditions. Still, this can only allow feature propagation on homogeneous graphs.

[53] attempted to extend feature propagation to heterogeneous graphs. In a heterogeneous graph, they assumed that there are two different sorts of nodes. With a threshold sparsifying the feature similarities, they first created a learnt feature similarity network for each type. Next, they generated the feature propagation graph for each type. Finally, the overall feature graph is obtained via channel attention [39]. Incidentally, there is a huge computational footprint of this method when dealing with a heterogeneous graph with large number of relations. More importantly, this method is more directed to heterogeneous graph sparsification rather than heterogeneous feature propagation.

2.3 Contrastive loss function

Contrastive Graph Learning (CGL) is derived from contrastive learning (CL) [23] for graph learning. CGL has been increasingly researched recently [4,5,6,7, 25] and has achieved excellent performance on graph or node classification by generating and contrasting positive and negative graph view pairs. Here, we organise our review by mainly focusing on contrastive loss function of the related CGL studies, which is helpful in understanding our contribution in Section 3.5. In [4, 7, 25, 54,55,56], the authors adopt the learning objective of CL rather straightforwardly. In doing so, they focused on node-level representations, and neglected the graph-level information. In [5, 6], for any node $v_{i}$, its embedding generated in one view $v^{\prime }_{i}$ and the embedding in the other view $v^{\prime \prime }_{i}$, form a positive pair, whereas embeddings of other nodes are negative samples. The pairwise objective for each positive pair $(v^{\prime }_{i},v^{\prime \prime }_{i})$ is defined as

$$\begin{aligned} \ell \left( v^{\prime }_{i}, v^{\prime \prime }_{i}\right) =-\log \frac{\text {exp}(\text {sim}\left( v^{\prime }_{i}, v^{\prime \prime }_{i}\right) / \tau )}{\underbrace{\text {exp}(\text {sim}\left( v^{\prime }_{i}, v^{\prime \prime }_{i}\right) / \tau )_{\text {positive pair }}+\mathbbm {E}+\mathbbm {A}}} \end{aligned}$$

(3)

where $\text {sim}$ denotes the function computing cosine similarity, $\tau$ is the temperature parameter, $\mathbbm {E}$ identifies contrasting of inter-view negative pairs as $\mathbbm {E} = \displaystyle {\sum _{k=1}^{N} \mathbbm {1}_{\left[ k \ne i\right] } \text {exp}({\text {sim}\left( v^{\prime }_{i}, v^{\prime \prime }_{k}\right) / \tau )}}$, and $\mathbbm {A}$ denotes the contrasting of intra-view negative pairs as $\mathbbm {A} = \displaystyle { \sum _{k=1}^{N} \mathbbm {1}_{[k \ne i]} \text {exp}({\text {sim}\left( v^{\prime }_{i}, v^{\prime }_{k}\right) / \tau )}}$, where $\mathbbm {1}_{[k \ne i]} \in \{0,1\}$ is an indicator function.

Connecting the above back to the heterogeneous graph neural networks, [47] proposed collaboratively contrastive optimization to expand the scope of defining positive samples and used it for self-supervised learning for heterogeneous graphs. However, in this state-of-the-art contrastive objective for heterogeneous graph learning, there is still a lack of consideration on feature propagation in graphs. Also, the frequent use of thresholds in the existing techniques decreases the feasibility of proposed models.

3 Proposed approach

In this section, we discuss the proposed multi-view image clustering with self-supervised contrastive heterogeneous graph co-learning (GoMIC), illustrated in Fig. 2. Our method encodes nodes from the local neighbourhood context and influence-oriented context, which fully captures contrastive structure of the heterogeneous affinity graph. This ensures that our approach well involves clustering boundary nodes (i.e., vague nodes) in the computations. During the encoding, we design innovative attention mechanisms, which learn feature propagation embeddings naturally. Moreover, in our method, a novel contrastive graph learning framework enables embeddings to supervise each other informatively.

3.1 Preliminaries

In order to describe our method for multi-view image clustering with contrastive heterogeneous graph learning, we first formalise the following pertinent notions for better understanding.

Definition 1

(Multi-view Image Data) Given an image dataset $\varvec{V} = \{v_{i}\}_{i=1}^{N}$ and its feature set $\varvec{X} = \{\varvec{x}_{i}\}_{i=1}^{N}$, where N denotes the number of data samples. Each node $v_{i}$ can be represented in multiple views $\{{m_{i}^{n}}\}_{n=1}^{M}$ with multi-view features $\{\varvec{x}_{i}^{n}\}_{n=1}^{M}$, where M is the number of views.

Definition 2

(Heterogeneous Affinity Graph) A heterogeneous affinity graph $G = (V, E, \phi , \psi )$ is constructed from given and multi-view data. V and E are the node and the edge sets of the original $\varvec{X}$, $\phi$ is the view-based node type mapping function, and $\psi$ is the view-based edge type mapping function. We let $\phi _{u}(V)$ denote the node embedding of the u-th view, and $\phi _{0}(V) = V$, i.e., the original node features are the 0-th view. By analogy, $\psi _{0}(E) = E$. Also, we let $\phi _{u}(v_{i}) = {m_{i}^{u}}$ and $\psi _{u}(e_{i}) = {p_{i}^{u}}$, where ${p^{u}_{i}}$ indicates an edge $e_{i}$ in the u-th view.

Definition 3

(Feature Propagation) Given the node feature set $\varvec{X} = \{\varvec{x}_{i}\}_{i=1}^{N}$ and the edge weight set $\varvec{W} = \{w_{i,j}\}$, where $i, j \in [1,N]$ and $i \ne j$, the propagated feature of a node $v_{i}$ is governed by

$$\begin{aligned} \widetilde{\varvec{x}_{i}}=\mathcal {P}(\varvec{x}_{i}, \{\widetilde{\varvec{x}_{j}}\},\{w_{i,j}\}; \theta _{p}), \end{aligned}$$

(4)

where $\widetilde{\varvec{x}_{j}}$ indicates the propagated feature set of $v_{i}$ neighbours, $\{w_{i,j}\}$ is the edge weight set of edges between $v_{i}$ and its neighbours, $\mathcal {P}$ is feature propagation function and $\theta _{p}$ is the propagation parameter.

Definition 4

(Local Propagation Network) The local propagation network $G^{\prime }(v_{i}) = (V^{\prime }(v_{i}), E^{\prime }(v_{i}),\phi , \psi )$ of a node $v_{i}$ is a directed graph, which consists of several local directed subgraphs. A local directed subgraph starts from $v_{i}$ though l hops, each of which finds $k_{l}$ nearest neighbours extracting the neighbourhood of $v_{i}$. Let $v_{a}$ be in the $(h-1)$-th hop and $v_{b}$ in the h-th hop. If $v_{a}$ and $v_{b}$ are the mutual first neighbours, this path stops walking at $v_{b}$, i.e., a path will stop walking when a node in it meets its mutual first neighbour in the next step, or when the path reaches the l-th hop. From the first to l-th hop, the feature propagation influence decreases.

Definition 5

(Influential Node) The target node $v_{i}$ can walk l steps to find its influential node $v_{i+l}$, which identifies a node with maximum degrees (i.e., degree centrality) and/or maximum density (i.e., density centrality) in the view-based subgraph of $v_{i}$.

Definition 6

(Influence-aware Propagation Network) The influence-aware propagation network $G^{\prime \prime }(v_{i}) = (V^{\prime \prime }(v_{i}), E^{\prime \prime }(v_{i}),\phi , \psi )$ of a node $v_{i}$ is a directed graph, which consists of several paths. A path starts from $v_{i}$ to a influential node $v_{i+l}$ through the shortest distance in the M-th view, which is in the form of ${m_{i}^{M}} \longrightarrow m_{i+1}^{M} \longrightarrow \cdots \longrightarrow m_{i+l}^{M}$. From ${m_{i}^{M}}$ to $m_{i+l}^{M}$, the feature propagation influence decreases.

3.2 Heterogeneous affinity graph construction

We construct the heterogeneous affinity graph $G = (V, E, \phi , \psi )$ from multiple views of images based on feature similarity. An edge in our graph indicates the possibility of two nodes having the same label. The graph G consists of $M+1$ homogeneous affinity graphs, which are related by the connections between the original view and each descriptor view, i.e., there are $M+2$ edge types in G. For each node $v_{i}$ and the original feature $\mathbf {x}_{i}$, its u-th view feature is $\phi _{u}\left( \mathbf {x}_{i}\right)$. According to these features, we adjust instance pivot subgraph (IPS) [13] to build the heterogeneous affinity graph G following the steps mentioned below.

Step 1: Feature extraction. In a single-view image dataset, given a node $v_{i}$, we utilise M different descriptors (e.g., Gabor [1], HOG [3]) to generate multiple views of $v_{i}$ - Fig. 2(a). This results in $M+1$ views of $v_{i}$ and different view-based feature vectors, including the original $\varvec{x}_{i}$, where the n-th view-based feature vector of $v_{i}$ is denoted as $\varvec{x}_{i}^{n}$. We note that, for benchmark multi-view image datasets [57, 58], standard multiple views of images are already available.

Step 2: Neighbourhood construction. In each view, we utilise h-hop kNN to build its neighbourhood-based subgraph. Let $k_{t}$ denote the k nearest neighbours at the t-th hop, where $t = 1,2,...,h$. As t increases, the neighbourhood influence towards $v_{i}$ decreases. Hence, the number of connecting nearest neighbours $k_{t}$ decreases as well. We add graph edges and their weights along with the neighbourhood discovery. For instance, for an edge $e_{i,j}$ between $v_{i}$ and $v_{j}$, their distance $d_{i,j}$ is computed using their sparse construction error $c_{i,j}$ as

$$\begin{aligned} d_{i,j}=\left\| \varvec{x}_{i}-c_{i,j} \varvec{x}_{j}\right\| _{2}^{2}. \end{aligned}$$

(5)

Then, the weight between $v_{i}$ and $v_{j}$, i.e. $w_{i,j}$ is defined as

$$\begin{aligned} w_{i,j}= {\left\{ \begin{array}{ll}1 &{} \text{ if } i=j \\ 1-\left( s_{i j}+s_{j i}\right) / 2 &{} \text{ if } i \ne j\end{array}\right. }, \end{aligned}$$

(6)

where $s_{ij}$ is the similarity score (i.e., Euclidean distance) between node $v_{i}$ and $v_{j}$, and $s_{ij} = s_{ji}$. Thus, we get the homogeneous affinity graph of each view. To constitute these affinity graphs as a heterogeneous affinity graph $G = (V, E, \phi , \psi )$ - Fig. 2(b), we connect each node $v_{i}$ with its corresponding other view-based nodes, where each edge weight is kept 1.

Step 3: Node density calculation. Based on the constructed heterogeneous affinity graph, we define the density for a node $v_{i}$ in the graph as

$$\begin{aligned} \rho _{v_{i}}=\frac{1}{\vert \mathcal {N}_{k_{1}}(v_{i})\vert } \sum _{v_{j} \in \mathcal {N}_{k_{1}}(v_{i})} \tilde{\varvec{x}_{i}}^{\intercal } \tilde{\varvec{x}}_{j}, \end{aligned}$$

(7)

where $\mathcal {N}_{k_{1}}(v_{i})$ is the first hop k nearest neighbours of the node $v_{i}$, and $\tilde{\varvec{x}_{i}}$, $\tilde{\varvec{x}_{j}}$ are the $\ell _{2}$-normalised feature embeddings of node $v_{i}$ and $v_{j}$. According to this formula, the density of $v_{i}$ is equal to the average of its similarity with its neighbours. Higher density nodes are considered more discriminative and are more influential in identifying cluster centres.

3.3 Local feature propagation encoder

The Local Feature Propagation (LFP) in our technique governs the interaction among the neighbourhood nodes. We aim to learn the propagated feature embedding of node $v_{i}$ through its neighbourhood, i.e. to find its neighbours with similar features. Here, we conduct feature propagation on each view-based homogeneous affinity graph, then process them with cross-view contrastive learning - Fig. 2(c). General aggregation strategies like mean-pooling and max-pooling cannot identify if nodes are important, i.e., their mutual first neighbours cannot be emphasised. We employ the following two steps to obtain the embedding of each node $v_{i}$ from the sight of local neighbourhood.

Step 1: Feature propagation. Based on (2) and discussion in Section 2.2, we incorporate influence of the first neighbours in feature propagation. To explain the concept, we take the n-th view-based node ${m_{i}^{n}}$ as an example. Its feature propagated embedding, which considers feature information and structural information simultaneously by emphasising the importance of mutual first neighbours, is computed as

$$\begin{aligned} \widetilde{\varvec{X}({m_{i}^{n}})}= & {} \mathcal {W}_{1} \varvec{X}({m_{i}^{n}})+\mathcal {W}_{2} \sum _{{m_{j}^{n}} \in \mathcal {N}({m_{i}^{n}})} \frac{1}{d_{i}} \widetilde{\varvec{X}({m_{j}^{n}})}\cdot \gamma _{[\underset{k=1}{\mathcal {N}}(v_{i})=v_{j}]} \nonumber \\+ & {} \mathcal {W}_{3} \sum _{{m_{j}^{n}} \in \mathcal {N}({m_{i}^{n}})} {\varvec{X}({m_{j}^{n}})} \cdot \gamma _{[\underset{k=1}{\mathcal {N}}(v_{i})=v_{j}]}, \end{aligned}$$

(8)

where $\gamma _{[\underset{k=1}{\mathcal {N}}(v_{i})=v_{j}]}$ is the indicator function for the mutual first neighbour enhancement during feature propagation. It governs the weight of the first neighbour’s influence towards the target node ${m_{i}^{n}}$. In (8) $d_{i} \in D$, where D is the degree matrix. Other notations follow from Section 2.2. For conciseness, the text below also avoids repeating explanation of other notations.

Step 2: Cross-view contrastive learning. After having the feature propagated embeddings of nodes in different M views, we feed them to a shared MLP which has a hidden layer to prepare them for the contrastive loss. In this case, we employ the conventional tactic of designating positive and negative samples. In other words, some view-based nodes that are formed from the same originating node form positive node pairs, whilst others form negative node pairs. However, our method aims at multi-view contrastive learning. Therefore, for the propagated n-th view-based node $\widetilde{{m_{i}^{n}}}$, we define the following contrastive loss for LFP

$$\begin{aligned} \ell (\widetilde{{m_{i}^{n}}}) = -\log \frac{\sum \limits _{k = 0}^{M}\text {exp}(\text {sim}\left( \widetilde{{m_{i}^{n}}}, \widetilde{{m_{i}^{k}}}\right) / \tau )}{\sum \limits _{k=0}^{M}\sum \limits _{v_{b} \in \mathcal {N}(v_{i})}\mathbbm {1}_{[i \ne b]}\text {exp}(\text {sim}\left( \widetilde{{m_{i}^{n}}}, \widetilde{{m_{b}^{k}}}\right) / \tau )}, \end{aligned}$$

(9)

where $\mathcal {N}(v_{i})$ indicates the nodes in the $v_{i}$-oriented subgraph, and $\mathbbm {1}_{[i \ne b]} \in \{0,1\}$ is an indicator function that equals to 1 if $i \ne b$. Then, the overall cost objective is given as

$$\begin{aligned} \mathcal {J}=\frac{1}{M\cdot \vert \mathcal {N}(v_{i})\vert } \sum \limits _{n=0}^{M}\ell (\widetilde{{m_{i}^{n}}}). \end{aligned}$$

(10)

Through the cross-view contrastive objective, we optimise the encoder via back-propagation and learn the embeddings $z_{i}^{\text {LPF}}$ of nodes for $v_{i}$.

3.4 Influence-aware feature propagation encoder

We also aim to learn the embedding of node $v_{i}$ under influence-aware feature propagation (IFP) - Fig. 2(d). For the target node $v_{i}$, different views contributes differently to its embedding. The embedding can not only be affected by the nearest neighbours, but also by the influential nodes in its neighbourhood of various views. Thus, we devise the influence-aware feature propagation encoder at node-level and view-level to hierarchically aggregate underlying information through the shortest paths from the target node $v_{i}$ towards the influential node $v_{i+l}$ (excluding the edge between $v_{i}$ and $v_{i+l}$) in different view-based subgraphs (see Def. 6). Take the n-th view as an example, we define a path from ${m_{i}^{n}}$ to its influential node $m_{i+1}^{n}$ as $p ({m_{i}^{n}}) = \{{m_{i}^{n}},\cdots ,m_{i+l}^{n}\}$ (see Def. 6).

Each path is processed via a GCN-based encoder to learn feature propagation. As shown Fig. 2(d), for each node (as a target node), its corresponding paths in various views will be the input. Having the path $p ({m_{i}^{n}}) = \{{m_{i}^{n}},\cdots ,m_{i+l}^{n}\}$, we define the affinity matrix $\varvec{A}({m_{i}^{n}})\in \mathbb {R}^{\vert p \left( {m_{i}^{n}}\right) \vert \times \vert p ({m_{i}^{n}})\vert }$ with the initial feature matrix denoted as $\varvec{X}({m_{i}^{n}})$. In the k-th layer of GCN, we update the feature matrix as

$$\begin{aligned} \varvec{X}^{k}\left( {m_{i}^{n}}\right)= & {} \sigma (\alpha \cdot \varvec{X}^{k-1}\left( {m_{i}^{n}}\right) \nonumber \\+ & {} (1-\alpha ) \varvec{D}^{-1}\left( {m_{i}^{n}}\right) \varvec{A}\left( {m_{i}^{n}}\right) \varvec{X}^{k-1}\left( {m_{i}^{n}}\right) \mathcal {W}^{k-1}), \end{aligned}$$

(11)

where $\varvec{X}^{k-1}\left( {m_{i}^{n}}\right)$ denotes the updated features of the k-1-th GCN layer for all nodes on the path of $\left( {m_{i}^{n}}\right)$, $\varvec{D}^{-1}\left( {m_{i}^{n}}\right)$ is the diagonal matrix with $\varvec{D}_{i, i}\left( {m_{i}^{n}}\right) =\sum _{j} \varvec{A}_{i, j}\left( {m_{i}^{n}}\right)$, $\sigma$ is the $\text {ReLU}$ activation, $\alpha$ is a learnable parameter that balances the importance of the updated features, and $\mathcal {W}^{k-1}$ is the transformation parameter. Through GCN, we can receive the embedding of ${m_{i}^{n}}$ as ${h_{i}^{n}}$. Next, we employ attention mechanism in view-level to hierarchically aggregate context information from other views to the target node $v_{i}$. We firstly compute the importance of the n-th view as

$$\begin{aligned} \beta _{n}=\frac{\exp \left( \frac{1}{\vert V \vert } \sum _{v_{i} \in V} \varvec{A}_{\text {IFP}}^{\top } \cdot \tanh \left( \mathcal {W}_{\text {IFP}} {h_{i}^{n}}+\mathcal {B}_{\text {IFP}}\right) \right) }{\sum \limits _{j=1}^{M} \exp \left( \frac{1}{\vert V \vert } \sum _{v_{i} \in V} \varvec{A}_{\text {IFP}}^{\top } \cdot \tanh \left( \mathcal {W}_{\text {IFP}} {h_{i}^{j}}+\mathcal {B}_{\text {IFP}}\right) \right) } \end{aligned}$$

(12)

where $\varvec{A}_{\text {IFP}}^{\top }$ indicates the transpose of affinity graph $\varvec{A}({m_{i}^{n}})$, $\tanh (\cdot )$ is the hyperbolic tangent function, $\mathcal {W}_{\text {IFP}}$ are learnable parameters for IFP, and $\mathcal {B}_{\text {IFP}}$ denotes view-level attention. Then, we compute the final embedding as follows

$$\begin{aligned} z_{i}^{\text {IPF}}=\sum _{n=1}^{M} \beta _{n} \cdot {h_{i}^{n}}. \end{aligned}$$

(13)

3.5 Dual-context contrastive graph learning

To maximise the mutual information between each pair of embeddings generated from LFP and IFP, we design a dual-context contrastive loss. To that end, we extend the definition of positive samples - Fig. 2(e). That is, for a node $v_{i}$, not only its LFP and IFP embeddings (i.e., $z_{i}^{\text {LFP}}$ and $z_{i}^{\text {IFP}}$) are mutually positive, but also the embeddings of its mutual first neighbour $v_{j}$ (i.e., $z_{j}^{\text {LFP}}$ and $z_{j}^{\text {IFP}}$) would be considered as its positive samples. We denote the positive sample set of $v_{i}$ as $\mathbb {P}_{i}$. This aims to contribute to nailing down clustering centres. According to the extension of positive sample definition, we formulate the dual-context contrastive loss function as

$$\begin{aligned} \mathfrak {L}(v_{i})=\mathfrak {L}(z_{i}^{\text {LFP}})+\mathfrak {L}(z_{i}^{\text {IFP}}). \end{aligned}$$

(14)

We describe the computation of $\mathfrak {L}(z_{i}^{\text {LFP}})$ below. The $\mathfrak {L}(z_{i}^{\text {IFP}})$ is computed analogously. We let

$$\begin{aligned} \mathfrak {L}(z_{i}^{\text {LFP}}) = -\log \frac{\sum \limits _{z_{j} \in \mathbb {P}_{i}}\text {exp}(\text {sim}\left( z_{i}^{\text {LFP}}, z_{j}\right) / \tau )}{\text {Neg.}}. \end{aligned}$$

(15)

In (15), $\text {Neg.}$ refers to contrasting against negative samples, which is defined as

$$\begin{aligned} \text {Neg.}= & {} \sum \limits _{k=1}^{N}\mathbbm {1}_{[i \ne k]}\mathbbm {1}_{[z_{k}^{\text {LFP}} \notin \mathbb {P}_{i} ]}\text {exp}(\text {sim}\left( z_{i}^{\text {LFP}}, z_{k}^{\text {LFP}}\right) / \tau )\nonumber \\+ & {} \sum \limits _{a=1}^{N}\mathbbm {1}_{[i \ne a]}\mathbbm {1}_{[z_{a}^{\text {LFP}} \notin \mathbb {P}_{i}]}\text {exp}(\text {sim}\left( z_{i}^{\text {LFP}}, z_{a}^{\text {IFP}}\right) / \tau ), \end{aligned}$$

(16)

where $z_{a}^{\text {IFP}}$ represents the IFP embedding of other nodes for contrasting, and two indicators separate out the negative samples. The overall contrastive objective of maximising mutual information is then

$$\begin{aligned} \mathcal {J}=\frac{1}{N} \sum _{i=1}^{N}\left[ \theta \cdot \mathfrak {L}(z_{i}^{\text {LFP}})+(1-\theta )\cdot \mathfrak {L}(z_{i}^{\text {IFP}})\right] , \end{aligned}$$

(17)

where hyper-parameter $\theta$ controls the relative importance of the two embeddings.

4 Evaluation

4.1 Experimental Setup

Datasets: To establish the effectiveness of our technique, we perform bechmarking of our method on popular standard multi-view image datasets.

1.
COIL-20 [57] has 1,440 gray-scale images of 20 classes and each class contains 72 images. Each image has been extracted under 3 views, where the first is a 1024-dimensional intensity feature, the second is a 3304-dimensional LBP feature, and the third is a 6750-dimensional Gabor feature.
2.
Caltech7 [58] contains 1,474 images of 7 classes. Each image has 6 views of features extracted. These views include 48-dimensional Gabor feature, 40-dimensional WM feature, 254-dimensional CENTRIST feature, 1984 dimensional HOG feature, 512-dimensional GIST feature, and 928-dimensional LBP feature.
3.
CASIA-WebFace (CWF) [59] is cleaned to contain 466,169 single-view face images of 10,575 real identities collected from the web. To obtain a relatively large-scale dataset while constraining the image numbers for the available hardware, we select 18 classes of 10,791 images to conduct our experiments. To generate multi-view data, we utilise the 512-dimensional feature vectors extracted by CNN architecture to obtain 3 views (512-dimensional color histogram, 26-dimensional LBP and 512-dimensional Gabor) from each image.

Baseline Methods: For benchmarking, we compare with the following six existing techniques that use a variety of approaches for multi-view image clustering.

1.
MIC [60] adopts the same strategy used by best single view (BSV) [61] to fill missing instances and then learns a non-negative low-dimensional consensus representation for all views. K-means is applied to the learned consensus representation for final clustering.
2.
DCCAE [62] extends the method of combining the deep encoder with Canonical Correlation Analysis by introducing a deep decoder, which has deep canonically correlated auto-encoders coordinating and extracting graph representations in a pairwise manner.
3.
MVGL [27] learns individual graphs and then integrate the multiple learned graphs into a global graph with exactly k components.
4.
MCGC [28] learns a consensus graph by minimising the disagreement between different views and constraining the rank of the Laplacian matrix.
5.
RHLC-CAGL [31] automatically captures a latent common Laplacian that is shared by all views.
6.
HeCo [47] is a self-supervised heterogeneous GNN method, which leverages meta-path to extract two views for contrastive learning.

Implementation Details: In GoMIC, for the single-view image dataset (i.e., CASIA-WebFace), we utilise commonly used feature descriptors, i.e., intensity, LBP and Gabor, to generate multiple views of each image. There are 2 hyperparameters for heterogeneous affinity graph construction in our method, namely, the number of hops $\mathbbm {h}$, and the number of each node’s nearest neighbours in each hop $\{k_{i}\}, i = 1, 2, ..., \mathbbm {h}$. $\mathbbm {h}$ and $\{k_{i}\}$ are defined according to the used dataset. We set $\mathbbm {h} = 4, \{k_{i}\} = \{10, 5, 1, 1\}$ for COIL-20, $\mathbbm {h} = 4, \{k_{i}\} = \{20, 5, 1, 1\}$ for Caltech7, and $\mathbbm {h} = 4, \{k_{i}\} = \{50, 3, 1, 1\}$ for CASIA-WebFace. Also, while setting up the GCN-based contrastive model, we optimise the learning rate in the range [$1e-4$, $5e-3$]. For the dropout function in encoding, we used the range [0.1, 0.5] with a step size 0.05, and $\tau$ is tuned in the range [0.5, 0.9] with a step size 0.05, and $\alpha$ and $\theta$ are both tuned in the range [0.1, 0.9] with a step size 0.1. Moreover, encoders only conduct aggregation once, i.e., we use 2-layer GCN on LFP and IFP. At the end of GoMIC, the images to be clustered are represented as a graph. Each edge in the graph is associated with a similarity weight in [0, 1]. To generate the clusters, we visit every node and only preserve its neighborhood nodes with the largest weight, i.e., the other neighbor nodes are disconnected from the node. Thus, the clusters get formed in an efficient manner.

Table 1 Comparison of clustering performance on three datasets. NMI, PUR and ACC respectively stand for normalised mutual information, purity and accuracy. Self-supervised graph learning-based methods are green highlighted. The best results in each column are bold faced

Full size table

4.2 Comparison with the existing state-of-the-art

Table 1 reports the results of several popular multi-view clustering methods and a state-of-the-art self-supervised heterogeneous contrastive graph learning technique, HeCo. The results reveal that multi-view clustering methods which make use of more views usually achieve higher performance. This explains why MIC [60] and DCCAE [62] generally have inferior clustering results. Since MIC relies on the best single view to conduct representation learning, and DCCAE correlates view-based graphs for embeddings pair-by-pair, they are not able to perform as well as other methods. In contrast, methods proposed more recently (MVGL [28], MCGC [28] and RHLC-CAGL [31]) aim to leverage more information from multiple views, which achieves better performance. In the table, HeCo [47] deals with natural heterogeneous networks instead of directly dealing with multi-view images. Nevertheless, it performs reasonably well on this benchmark due to the suitability of heterogeneous graphs to the problem. This is inline with our intuition of exploiting heterogeneous properties of image views for clustering. It is therefore not surprising that the performance of our method, GoMIC, outperforms these powerful baselines. Our approach not only contrasts numerous views but also makes use of two recently developed feature propagation encoding schemes for enhanced contrastive learning.

Table 2 Ablation study results for GoMIC. GoLFP only uses LFP encoder. GoIFP uses only the IFP encoder and GoMIC-n/e does not use the extended definition of positive samples, but uses both LFP and IFP. Difference of these variants to GoMIC highlights the contribution of the components

Full size table

4.3 Ablation study and parameter analysis

GoMIC encodes multi-view based graphs with two innovative encoding schemes, namely, LFP and IFP. Also, for better clustering centres, we adjust the contrastive loss function by extending the positive sample definition. To understand the impact of these factors in the overall performance of our technique, we introduce three variants of GoMIC to conduct an ablation study. These three variants are respectively denoted as: (1) GoLFP, which contains only LFP as the encoder; (2) GoIFP, which only contains IFP as the encoder and (3) GoMIC-n/e, which has no extension of the positive sample definition. The results of these variants on all three datasets are summarised in Table 2. In the table, we can observe that GoMIC eventually outperforms all these variants by a considerable margin, establishing the benefits of synergising these proposed components. Furthermore, the performances of GoLFP and GoIFP decrease differently when applied to different datasets. This highlights that, for different cases, LFP and IFP are able to contribute differently to the overall performance. A consistent gain of GoMIC over GoMIC-n/e also ascertains the importance of our positive sample definition extension. From the parameters viewpoint, GoMIC has two major hyper-parameters, $\alpha$ and $\theta$. We show the influence of adjusting their values on performance in Fig. 3. The chosen range values are $\{0.1, 0.3, 0.6, 0.9\}$.

5 Conclusions

In order to understand and exploit relationships within image datasets from each node’s local neighbourhood and influence-aware context, we introduced an innovative multi-view image clustering method called GoMIC. GoMIC takes advantage of the heterogeneous properties of multi-view image data under contrastive graph learning. To extract and exploit more underlying information, we devised two strategies to encode the graphs, Local Feature Propagation (LFP) and Influence-aware Feature Propagation (IFP), to represent each node-based subgraph in two contrasting contexts. Also, we employed two contrastive loss functions, and adjusted them to fit the use of LFP and IFP. The first loss function aims to integrate multiple view-based LFP embeddings, and the second nails down the clustering centres with an extended positive sample definition for contrastive graph learning. Experimental results show that our proposed method consistently outperforms the state-of-the-art methods on multi-view clustering benchmarks. Also, our ablation study demonstrates explicit contribution of each novel aspect in our overall technique. Currently, the framework works with the common assumption of balanced data. In the future, we will extend it to also handle imbalanced data.

Authors’ information

Uno Fang is a Ph.D. candidate, and Jianxin Li is an Associate Professor at School of IT, Deakin University. Naveed Akhtar a Senior Lecturer of Machine Learning, AI and Data Science at the Department of Computer Science & Software Engineering at the University of Western Australia. Man Li is pursuing her the Ph.D. degree with School of IT, Deakin University. Yan Jia is a Professor at the Department of Computer Science and Technology, Harbin Institute of Technology.

Data Availability

The data that support the findings of this study are openly available in COIL-20 [57] at https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php, Caltech7 [58] at https://data.caltech.edu/, and CASIA-WebFace [59] at https://mldta.com/dataset/casia-webface/.

References

Lades, M., Vorbruggen, J.C., Buhmann, J., Lange, J., Von Der Malsburg, C., Wurtz, R.P., et al.: Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Comput. 42(3), 300–311 (1993)
Article Google Scholar
Manjunath, B.S., Ohm, J.R., Vasudevan, V.V., Yamada, A.: Color and texture descriptors. IEEE Trans. Circuits Syst. Video Technol. 11(6), 703–715 (2001)
Article Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). vol. 1. pp. 886–893. Ieee (2005)
Wan, S., Pan, S., Yang, J., Gong, C.: Contrastive and generative graph convolutional networks for graph-based semi-supervised learning. Proceedings of the AAAI Conference on Artificial Intelligence 35(11), 10049–10057 (2021)
Article Google Scholar
Zeng, J., Xie, P.: Contrastive self-supervised learning for graph classification. Proceedings of the AAAI Conference on Artificial Intelligence 35(12), 10824–10832 (2021)
Article Google Scholar
Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., Wang, L.: Graph contrastive learning with adaptive augmentation. In: Proceedings of the Web Conference vol. 2021. p. 2069–2080 (2021)
Hafidi, H., Ghogho, M., Ciblat, P., Swami, A.: Negative sampling strategies for contrastive self-supervised learning of graph representations. Signal Process. 190, 108310 (2022)
Article Google Scholar
Sharma, K.K., Seal, A.: Outlier-robust multi-view clustering for uncertain data. Knowl. Based Syst. 211, 106567 (2021)
Article Google Scholar
Bansal, M., Sharma, D.: A novel multi-view clustering approach via proximity-based factorization targeting structural maintenance and sparsity challenges for text and image categorization. Inf. Process. Manag. 58(4), 102546 (2021)
Article Google Scholar
Ueda, I., Shishido, H., Kitahara, I.: Spatio-temporal aggregation of skeletal motion features for human motion prediction. Array 15, 100212 (2022). https://doi.org/10.1016/j.array.2022.100212
Article Google Scholar
Ntelemis, F., Jin, Y., Thomas, S.A.: Information maximization clustering via multi-view self-labelling. Knowledge-Based Systems 109042 (2022)
Yuan, C., Zhu, Y., Zhong, Z., Zheng, W., Zhu, X.: Robust Self-Tuning Multi-View Clustering. World Wide Web 25(2), 489–512 (2022). https://doi.org/10.1007/s11280-021-00945-9
Article Google Scholar
Wang, Z., Zheng, L., Li, Y., Wang, S.: Linkage based face clustering via graph convolution network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition p. 1117–1125 (2019)
Yang, L., Zhan, X., Chen, D., Yan, J., Loy, C.C., Lin, D.: Learning to cluster faces on an affinity graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 2298–2306 (2019)
Li, S., Liu, B., Chen, D., Chu, Q., Yuan, L., Yu, N.: Density-aware graph for deep semi-supervised visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 13400–13409 (2020)
Yang, L., Chen, D., Zhan, X., Zhao, R., Loy, C.C., Lin, D.: Learning to cluster faces via confidence and connectivity estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 13369–13378 (2020)
Yin, H., Song, X., Yang, S., Li, J.: Sentiment analysis and topic modeling for COVID-19 vaccine discussions. World Wide Web 25(05), 1–17 (2022). https://doi.org/10.1007/s11280-022-01029-y
Article Google Scholar
Yang, Y., Guan, Z., Li, J., Zhao, W., Cui, J., Wang, Q.: Interpretable and Efficient Heterogeneous Graph Convolutional Network. IEEE Transactions on Knowledge and Data Engineering p. 1–1 (2021). https://doi.org/10.1109/TKDE.2021.3101356.
Zhang, M., Wang, G., Ren, L., Li, J., Deng, K., Zhang, B.: METoNR: A meta explanation triplet oriented news recommendation model. Knowledge-Based Systems 238, 107922 (2022). https://doi.org/10.1016/j.knosys.2021.107922
Article Google Scholar
Mitra, S., Banerjee, S., Naskar, M.K.: Remodelling correlation: A fault resilient technique of correlation sensitive stochastic designs. Array 15, 100219 (2022). https://doi.org/10.1016/j.array.2022.100219
Article Google Scholar
Myllyaho, L., Nurminen, J.K., Mikkonen, T.: Node co-activations as a means of error detection-Towards fault-tolerant neural networks. Array 15, 100201 (2022). https://doi.org/10.1016/j.array.2022.100201
Article Google Scholar
Song, X., Li, J., Lei, Q., Zhao, W., Chen, Y., Mian, A.: Bi-CLKT: Bi-Graph Contrastive Learning Based Knowledge Tracing. Know-Based Syst 241(C) (2022). https://doi.org/10.1016/j.knosys.2022.108274.
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. p. 1597–1607. PMLR (2020)
Liao, J., Zhao, X., Li, X., Tang, J., Ge, B.: Contrastive Heterogeneous Graphs Learning for Multi-Hop Machine Reading Comprehension. World Wide Web 25(3), 1469–1487 (2022). https://doi.org/10.1007/s11280-021-00980-6
Article Google Scholar
Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: International Conference on Machine Learning. p. 4116–4126. PMLR (2020)
Wang Y, Min Y, Chen X, Wu J. Multi-view Graph Contrastive Representation Learning for Drug-Drug Interaction Prediction. In: WWW ’21: The Web Conference 2021. ACM / IW3C2. p. 2921–2933 (2021)
Zhan, K., Zhang, C., Guan, J., Wang, J.: Graph learning for multiview clustering. IEEE Trans. Cybern. 48(10), 2887–2895 (2017)
Article Google Scholar
Zhan, K., Nie, F., Wang, J., Yang, Y.: Multiview consensus graph clustering. IEEE Trans. Image Process. 28(3), 1261–1270 (2018)
Article MathSciNet Google Scholar
Xie, Y., Tao, D., Zhang, W., Liu, Y., Zhang, L., Qu, Y.: On unifying multi-view self-representations for clustering by tensor multi-rank minimization. Int. J. Comput. Vis. 126(11), 1157–1179 (2018)
Article MathSciNet MATH Google Scholar
Chen, Y., Wang, S., Peng, C., Hua, Z., Zhou, Y.: Generalized Nonconvex Low-Rank Tensor Approximation for Multi-View Subspace Clustering. IEEE Trans. Image Process. 30, 4022–4035 (2021)
Article MathSciNet Google Scholar
Jing, P., Su, Y., Li, Z., Nie, L.: Learning robust affinity graph representation for multi-view clustering. Inf. Sci. 544, 155–167 (2021)
Article MathSciNet MATH Google Scholar
Liu, J., Teng, S., Fei, L., Zhang, W., Fang, X., Zhang, Z., et al.: A novel consensus learning approach to incomplete multi-view clustering. Pattern Recognit. 115, 107890 (2021)
Article Google Scholar
Shi, S., Nie, F., Wang, R., Li, X.: Multi-View Clustering via Nonnegative and Orthogonal Graph Reconstruction. IEEE Transactions on Neural Networks and Learning Systems (2021)
Xia, W., Wang, Q., Gao, Q., Zhang, X., Gao, X.: Self-supervised Graph Convolutional Network for Multi-view Clustering. IEEE Transactions on Multimedia (2021)
Haldar, N., Li, J., Ali, M., Cai, T., Chen, Y., Sellis, T., et al.: Top-k Socio-Spatial Co-engaged Location Selection for Social Users. IEEE Transactions on Knowledge and Data Engineering. Publisher Copyright: IEEE (2022 ). https://doi.org/10.1109/TKDE.2022.3151095.
Xue, G., Zhong, M., Li, J., Chen, J., Zhai, C., Kong, R.: Dynamic network embedding survey. Neurocomputing 472, 212–223 (2022). https://doi.org/10.1016/j.neucom.2021.03.138
Article Google Scholar
Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., et al.: Heterogeneous graph attention network. In: The World Wide Web Conference. p. 2022–2032 (2019)
Fu, X., Zhang, J., Meng, Z., King, I.: Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In: Proceedings of The Web Conference vol. 2020. p. 2331–2341 (2020)
Yun, S., Jeong, M., Kim, R., Kang, J., Kim, H.J.: Graph transformer networks. Adv. Neural Inf. Process. Syst. 32, 11983–11993 (2019)
Google Scholar
Zhong, Q., Liu, Y., Ao, X., Hu, B., Feng, J., Tang, J., et al.: Financial defaulter detection on online credit payment via multi-view attributed heterogeneous information network. In: Proceedings of The Web Conference 2020 p. 785–795 (2020)
Zheng, S., Guan, D., Yuan, W.: Semantic-aware heterogeneous information network embedding with incompatible meta-paths. World Wide Web 25(1), 1–21 (2022)
Article Google Scholar
Zhang, C., Song, D., Huang, C., Swami, A., Chawla, N.V.: Heterogeneous graph neural network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. p. 793–803 (2019)
Dong, Y., Fu, Y., Wang, L., Chen, Y., Dong, Y., Li, J.: A Sentiment Analysis Method of Capsule Network Based on BiLSTM. IEEE Access 02PP, 1–1 (2020). https://doi.org/10.1109/ACCESS.2020.2973711.
Kong, X., Xia, F., Li, J., Hou, M., Li, M., Xiang, Y.: A shared bus profiling scheme for smart cities based on heterogeneous mobile crowdsourced data. IEEE Transactions on Industrial Informatics 10PP, 1–1 (2019 ). https://doi.org/10.1109/TII.2019.2947063.
Zhao, J., Wang, X., Shi, C., Liu, Z., Ye, Y.: Network schema preserved heterogeneous information network embedding. In: 29th International Joint Conference on Artificial Intelligence (IJCAI) (2020)
Hu, Z., Dong, Y., Wang, K., Sun, Y.: Heterogeneous graph transformer. In: Proceedings of The Web Conference vol. 2020. p. 2704–2710 (2020)
Wang, X., Liu, N., Han, H., Shi, C.: Self-supervised heterogeneous graph neural network with co-contrastive learning. In: KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. p. 1726–1736. ACM (2021)
Wang, J., Shi, Y., Li, D., Zhang, K., Chen, Z., Li, H.: McHa: a multistage clustering-based hierarchical attention model for knowledge graph-aware recommendation. World Wide Web. 25(3), 1103–1127 (2022)
Article Google Scholar
Cai, W., Wang, Y., Mao, S., Zhan, J., Jiang, Y.: Multi-heterogeneous neighborhood-aware for Knowledge Graphs alignment. Inf. Process. Manag. 59(1), 102790 (2022)
Article Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Stanford InfoLab (1999)
Xiang, B., Liu, Q., Chen, E., Xiong, H., Zheng, Y., Yang, Y.: Pagerank with priors: An influence propagation perspective. In: Twenty-Third International Joint Conference on Artificial Intelligence (2013)
Dornaika, F.: On the use of high-order feature propagation in Graph Convolution Networks with Manifold Regularization. Inf. Sci. 584, 467–478 (2022)
Article Google Scholar
Zhao, J., Wang, X., Shi, C., Hu, B., Song, G., Ye, Y.: Heterogeneous graph structure learning for graph neural networks. In: 35th AAAI Conference on Artificial Intelligence (AAAI) (2021)
Wang, R., Li, L., Tao, X., Dong, X., Wang, P., Liu, P.: Trio-based collaborative multi-view graph clustering with multiple constraints. Inf. Process. Manag. 58(3), 102466 (2021)
Article Google Scholar
Li, J., Zeng, H., Peng, L., Zhu, J., Liu, Z.: Learning to rank method combining multi-head self-attention with conditional generative adversarial nets. Array 15, 100205 (2022). https://doi.org/10.1016/j.array.2022.100205
Article Google Scholar
Zhong, G., Shu, T., Huang, G., Yan, X.: Multi-view spectral clustering by simultaneous consensus graph learning and discretization. Knowledge-Based Systems 235, 107632 (2022)
Article Google Scholar
Nayar, M.H.: Columbia Object Image Library: COIL-100. Department of Computer Science, Columbia University. CUCS-006-96 (1996)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: 2004 conference on computer vision and pattern recognition workshop. p. 178–178. IEEE (2004)
Banerjee, S., Scheirer, W., Bowyer, K., Flynn, P.: On hallucinating context and background pixels from a face mask using multi-scale gans. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. p. 300–309 (2020)
Shao, W., He, L., Philip, S.Y.: Multiple incomplete views clustering via weighted nonnegative matrix factorization with $l_{2, 1}$ regularization. In: Joint European conference on machine learning and knowledge discovery in databases. p. 318–334. Springer (2015)
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. p. 267–273 (2003)
Wang, W., Arora, R., Livescu, K., Bilmes, J.: On deep multi-view representation learning. In: International conference on machine learning. p. 1083–1092. PMLR (2015)

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions

Author information

Authors and Affiliations

School of IT, Deakin University, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Uno Fang, Jianxin Li & Man Li
School of Physics, Mathematics and Computing, The University of Western Australia, 35 Stirling Highway, 6009, Crawley, WA, Australia
Naveed Akhtar
Department of Computer Science and Technology, Harbin Institute of Technology, 518055, Shenzhen, Guangdong, China
Yan Jia

Authors

Uno Fang
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Naveed Akhtar
View author publications
You can also search for this author in PubMed Google Scholar
Man Li
View author publications
You can also search for this author in PubMed Google Scholar
Yan Jia
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Uno Fang: Conceptualisation of this study, Methodology, Writing - Original Draft, Software. Jianxin Li: Project administration,Supervision, Writing - Review & Editing. Naveed Akhtar: Writing - Review & Editing, Validation. Man Li: Writing - Review & Editing. Yan Jia: Writing - Review & Editing. All authors reviewed the manuscript.

Corresponding author

Correspondence to Jianxin Li.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest. The authors have no relevant financial or non-financial interests to disclose. The authors have no conflicts of interest to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organisation or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Consent for publication

The authors confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. The authors further confirm that the order of authors listed in the manuscript has been approved. The authors confirm that they have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing the authors confirm that they have followed the regulations of our institutions concerning intellectual property. The authors are consent for publication.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fang, U., Li, J., Akhtar, N. et al. GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning. World Wide Web 26, 1667–1683 (2023). https://doi.org/10.1007/s11280-022-01110-6

Download citation

Received: 05 August 2022
Revised: 09 September 2022
Accepted: 24 September 2022
Published: 12 October 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11280-022-01110-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning

Abstract

Similar content being viewed by others

Self-supervised Multi-view Clustering Framework with Graph Filtering and Contrast Fusion

Consensus representation-driven structured graph learning for multi-view clustering

Multiview Learning via Non-negative Matrix Factorization for Clustering Applications

1 Introduction

2 Prior art & background

2.1 Heterogeneous graph neural network

2.2 Feature propagation

2.3 Contrastive loss function

3 Proposed approach

3.1 Preliminaries

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

3.2 Heterogeneous affinity graph construction

3.3 Local feature propagation encoder

3.4 Influence-aware feature propagation encoder

3.5 Dual-context contrastive graph learning

4 Evaluation

4.1 Experimental Setup

4.2 Comparison with the existing state-of-the-art

4.3 Ablation study and parameter analysis

5 Conclusions

Authors’ information

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Consent for publication

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation