Heterogeneous Graph Contrastive Learning with Meta-path Contexts and Weighted Negative Samples

Heterogeneous graph contrastive learning has received wide attention recently. Some existing methods use meta-paths, which are sequences of object types that capture semantic relationships between objects, to construct contrastive views. However, most of them ignore the rich meta-path context information that describes how two objects are connected by meta-paths. On the other hand, they fail to distinguish hard negatives from false negatives, which could adversely aﬀect the model performance. To address the problems, we propose MEOW, a heterogeneous graph contrastive learning model that considers both meta-path contexts and weighted negative samples. Speciﬁcally, MEOW constructs a coarse view and a ﬁne-grained view for contrast. The former reﬂects which objects are connected by meta-paths, while the latter uses meta-path contexts and characterizes the details on how the objects are connected. We take node embeddings in the coarse view as anchors, and construct positive and negative samples from the ﬁne-grained view. Further, to distinguish hard negatives from false negatives, we learn weights of negative samples based on node clustering. We also use prototypical contrastive learning to pull close embeddings of nodes in the same cluster. Finally, we conduct extensive experiments to show the superiority of MEOW against other state-of-the-art methods.


Introduction
Heterogeneous information networks (HINs) are prevalent in the real world, such as social networks, citation networks, and knowledge graphs. In HINs, nodes (objects) are of different types to represent entities and edges (links) are also of multiple types to characterize various relations between entities. For example, in Facebook, we have entities like users, posts, photos and groups; users can publish posts, upload photos and join groups. Compared with homogeneous graphs where all the nodes and edges are of a single type, HINs contain richer semantics and more complicated structural information. To further enrich the information of HINs, nodes are usually associated with labels. Since ob-ject labeling is costly, graph neural networks (GNNs) [16,29,4] have recently been applied for classifying nodes in HINs and have shown to achieve superior performance.
Despite the success, most existing heterogeneous graph neural network (HGNN) models require a large amount of training data, which is difficult to obtain. To address the problem, self-supervised learning, which is in essence unsupervised learning, has been applied in HINs [19,10]. The core idea of self-supervised learning is to extract supervision from data itself and learn highquality representations with strong generalizability for downstream tasks. In particular, contrastive learning, as one of the main self-supervised learning types, has recently received significant attention. Contrastive learning aims to construct positive and negative pairs for contrast, following the principle of maximizing the mutual information (MI) [17] between positive pairs while minimizing that between negative pairs. Although some graph contrastive learning methods for HINs have already been proposed [19,11,10], most of them suffer from the following two main challenges: contrastive view construction and negative sample selection.
On the one hand, to construct contrastive views, some methods utilize meta-paths [39,30]. A meta-path, which is a sequence of object types, captures the semantic relation between objects in HINs. For example, if we denote the object types User and Group in Facebook as "U" and "G", respectively, the meta-path User-Group-User (UGU) expresses the co-participation relation. Specifically, two users u 1 and u 2 are UGU-related if a path instance u 1 − g − u 2 exists, where g is a group object and describes the contextual information on how u 1 and u 2 are connected. The use of meta-paths can identify a set of path-based neighbors that are semantically related to a given object and provide different views for contrast. However, existing contrastive learning methods omit the contextual information in each meta-path view. For example, HeCo [30] takes metapaths as views, but it only uses the fact that two objects are connected by meta-paths and discards the contexts of how they are semantically connected, which we will call meta-path contexts and can be very influential in the classification task. For example, a group can provide valuable hints on a user's topic interests. Therefore, contrasting meta-path views with rich contexts is a necessity.
On the other hand, negative sample selection is another challenge to be addressed. Note that most existing graph contrastive learning methods [33,40,5] are formulated in a sampled noise contrastive estimation framework. For each node in a view, random negative sampling from the rest of intra-view and inter-view nodes is widely adopted. However, this could introduce many easy negative samples and false negative samples. For easy negative samples, they are less informative and easily lead to the vanishing gradient problem [37], while the false negative samples can adversely affect the learning process for providing incorrect information. Recently, there exist some works [20,37,41] that seek to identify hard negative samples for improving the discriminative power of encoders in HINs. Despite their success, most of them fail to distinguish hard negatives from false ones. While ASA [20] is proposed to solve the issue, it is specially designed for the link prediction task and can only generate negative samples for objects based on one type of relation in HINs, which restricts its wide applicability. Since there is not a clear cut boundary between false negatives and hard ones, how to balance the exploitation of hard negative and false negative remains to be investigated.
In this paper, to solve the two challenges, we propose a heterogeneous graph contrastive learning method MEOW with meta-path contexts and weighted negative samples. Based on meta-paths, we construct two novel views for contrast: the coarse view and the fine-grained view. The coarse view expresses that two objects are connected by meta-paths, while the fine-grained view utilizes meta-path contexts and describes how they are connected. In the coarse view, we simply aggregate all the meta-paths and generate node embeddings that are taken as anchors. In the fine-grained view, we construct positive and negative samples for each anchor. Specifically, for each meta-path, we first generate nodes' embeddings based on the meta-path induced graph. To further improve the generalizability of the model, we introduce noise by performing graph perturbations, such as edge masking and feature masking, on the meta-path induced graph to derive an augmented one, based on which we also generate node embeddings. In this way, each meta-path generates two embedding vectors for each node. After that, for each node, we fuse different embeddings from various meta-paths to generate its final embedding vector. Then for each anchor, its embedding vector in the fine-grained view is taken as a positive sample while those of other nodes are con-sidered as negative samples. Moreover, to employ hard negatives and mitigate the adverse influence of false negatives, we learn the importance of negative samples. We perform node clustering and use the results to grade the weights of negative samples. To further boost the model performance, we employ prototypical contrastive learning [15], where the cluster centers, i.e., prototype vectors, are used as positive/negative samples. This helps learn compact embeddings for nodes in the same cluster by pushing nodes close to their corresponding prototype vectors and far away from other prototype vectors. Finally, we summarize our contributions as: • We propose a novel heterogeneous graph contrastive learning model MEOW, which constructs a coarse view and a fine-grained view for contrast based on meta-paths, respectively. The former shows objects are connected by meta-paths, while the latter employs meta-path contexts and expresses how objects are connected by meta-paths.
• We distinguish hard negatives from false ones by performing node clustering and using the results to grade the weights of negative samples. Based on the clustering results, we also introduce prototypical contrastive learning to help learn compact embeddings of nodes in the same cluster.
• We conduct extensive experiments comparing MEOW with other 9 state-of-the-art methods w.r.t. node classification and node clustering tasks on three public HIN datasets. Our results show that MEOW achieves better performance than other competitors.
2 Related Work

Heterogeneous Graph Neural Network
Heterogeneous graph neural network (HGNN) has recently received much attention and there have been some models proposed. For example, HetGNN [35] aggregates information from neighbors of the same type with bi-directional LSTM to obtain type-level neighbor representations, and then fuses these neighbor representations with the attention mechanism. HGT [9] designs Transformer-like attention architecture to calculate mutual attention of different neighbors. HAN [29] employs both node-level and semantic-level attention mechanisms to learn the importance of neighbors under each meta-path and the importance of different metapaths, respectively. Considering meta-path contexts information, MAGNN [4] improves HAN by employing a meta-path instance encoder to incorporate intermediate semantic nodes. Further, Graph Transformer Networks (GTNs) [32] are capable of generating new graph struc-tures, which can identify useful connections between unconnected nodes in the original graph and learn effective node representation in the new graphs. Despite the success, most of these methods are semi-supervised, which heavily relies on labeled objects.

Graph Contrastive Learning (GCL)
Contrastive learning aims to construct positive and negative pairs for contrast, whose goal is to pull close positive pairs while pushing away negative ones. Recently, some works have applied contrastive learning to graphs [6,42]. In particular, most of these approaches use data augmentation to construct contrastive views and adopt the following three main contrast mechanisms: (1) node-node contrast [21,12,27]; (2) graphgraph contrast [26,33]; (3) node-graph contrast [7,18]. For example, GRACE [40] treats two augmented graphs by node feature masking and edge removing as two contrastive views and then pulls the representation of the same nodes close while pushing the remaining nodes apart. Inspired by SimCLR [2] in the visual domain, GraphCL [31] further extends this idea to graphstructured data, which relies on node dropping and edge perturbation to generate two perturbed graphs and then maximizes the two graph-level mutual information (MI). Moreover, DGI [28] is the first approach to propose the contrast between node-level embeddings and graph-level embeddings, which allows graph encoders to learn local and global semantic information. In heterogeneous graphs, HeCo [30] takes two views from network schema and meta-paths to generate node representations and perform contrasts between nodes. HDGI [22] extends DGI to HINs and learns high-level node representations by maximizing MI between local and global representations. However, most of these methods select negative samples by random sampling, which will introduce false negatives. These samples will adversely affect the learning process, so we need to distinguish them from hard negatives.

Hard Negative Sampling
In contrastive learning, easy negative samples are easily distinguished from anchors, while hard negative ones are similar to anchors. Recent studies [23] have shown that contrastive learning can benefit from hard negatives, so there are some works that explore the construction of hard negatives. The most prominent method is based on mixup [36], a data augmentation strategy for creating convex linear combinations between samples. In the area of computer vision, Mochi [13] measures the distance between samples by inner product and randomly selects two samples from N nearest ones to be combined by mixup as synthetic negative sam-ples. Further, CuCo [1] uses cosine similarity to measure the difference of nodes in homogeneous graphs. In heterogeneous graphs, STENCIL [41] uses meta-pathbased Laplacian positional embeddings and personalized PageRank scores for modeling local structural patterns of the meta-path-induced view. However, these methods either fail to distinguish hard negative samples from false ones or are built on one type of relation in HINs, which restricts the wide applicability of these models.
3 Preliminary where V is a set of nodes and E is a set of edges, each represents a binary relation between two nodes in V. Further, G is associated with two mappings: (1) node type mapping function φ : V → T and (2) edge type mapping function ψ : E → R, where T and R denote the sets of node and edge types, respectively. If |T |+|R| > 2, the network G is an HIN; otherwise it is homogeneous.
between nodes of types T 1 and T l+1 , where • denotes the composition operator on relations. If two nodes x i and x j are related by the composite relation R, then there exists a path that connects x i to x j in G, denoted by p xi xj . Moreover, the sequence of nodes and edges in p xi xj matches the sequence of types T 1 , ..., T l+1 and relations R 1 , ..., R l according to the node type mapping φ and the edge type mapping ψ, respectively. We say that p xi xj is a path instance of P, denoted by p xi xj P.
Definition 3. Meta-path Context [16]. Given two objects x i and x j that are related by a meta-path P, the meta-path context is the set of path instances of P between x i and x j .

Definition 4. Heterogeneous Graph Contrastive
Learning. Given an HIN G, our task is to learn node representations by constructing positive and negative pairs for contrast. In this paper, we only focus on one type of nodes, which are considered as target nodes.

Methodology
In this section, we introduce our method MEOW. The general model diagram is shown in Fig. 1. We perform feature transformation and neighbor filtering as preprocessing steps. First, we map the feature vectors of each different type of nodes into the same dimension (Step x) and identify a set of neighbors for nodes based on each meta-path (Step y). Then, we construct a coarse view by aggregating all meta-paths (Step z), while constructing a fine-grained view with each metapath's contextual semantic information (Step {). After that, we fuse different embeddings from various metapaths in the fine-grained view through the attention mechanism (Step |). We take node embeddings in the coarse view as anchors and those in the fine-grained view as the positive and negative samples. To be capable of distinguishing false negative samples and hard negative samples, we perform clustering and assign weights to the negative samples with the clustering results (Step }). Finally, to further boost the model performance, we use prototypical contrastive learning to calculate the contrastive loss and prototypical loss based on the node embedding vectors under coarse view and fine-grained view and the clustering results (Step~). Next, we describe each component in detail.

Node Feature Transformation
Since an HIN is composed of different types of nodes and each type has its own feature space, we need to first preprocess node features to transform them into the same space. Specifically, for each object x i in type T , we use the type-specific mapping matrix W (1) T to transform the raw features of x i into: where h i ∈ R d is the projected initial embedding vector of x i , σ(·) is an activation function, and b T denotes the bias vector.

Neighbor Filtering
Given an object x, meta-paths can be used to derive its multi-hop neighbors with specific semantics. When meta-paths are long, the number of related neighbors to x could be numerous. Directly aggregating information from these neighbors to generate x's embedding will be time-consuming. On the other hand, the irrelevant neighbors of x cannot provide useful information to predict x's label and they could adversely affect the quality of the generated embedding of x. Therefore, we filter x's meta-path induced neighbors and select the most relevant to x. Inspired by [16], we adopt PathSim [25] to measure the similarity between objects. Specifically, given a meta-path P, the similarity between two objects x i and x j of the same type w.r.t. P is computed by: where p xi xj is a path instance between x i and x j . Based on the similarities, for each object, we select its top-K neighbors with the largest similarity. The removal of irrelevant neighbors can significantly reduce the number of neighbors for each object, which further improves the model efficiency. After neighbor filtering, the induced adjacency matrix by meta-path P is denoted as A P .

Coarse View
We next construct coarse view to describe which objects are connected by meta-paths. Given a set of meta-paths, each meta-path P can induce its own adjacency matrix A P . To provide a coarse view on the connectivity between objects by meta-paths, we fuse the meta-path induced adjacency matrices and define A = 1 m ( A P1 + A P2 + · · · + A Pm ), where m is the number of meta-paths and A Pu = D −1/2 A Pu D −1/2 is the normalized adjacency matrix. Here, D is a diagonal matrix where |V | is the number of target nodes. After that, we feed node embeddings calculated by Equation 4.1 and A into a two-layer GCN encoder to get the representations of nodes in the coarse view. Specifically, for node x i , we can get its coarse representation z c i :

Fine-grained View
The fine-grained view characterizes how two objects are connected by meta-paths, which is in contrast with the coarse view. Given a meta-path set PS = {P 1 , ..., P m }, for each meta-path P u ∈ PS, let P u = T 0 T 1 ...T l , where the meta-path length is l + 1. The meta-path can link objects of type T 0 to that of type T l via a series of intermediate object types. Since meta-path contexts are composed of path instances and capture details on how two objects are connected, we utilize meta-path contexts to learn fine-grained representations for objects. However, when l is large, due to the numerous path instances between two objects, directly handling each path instance as MAGNN [4] could significantly degenerate the model efficiency, as pointed out in [16]. We instead use objects in the intermediate types of metapath P u to leverage the information of meta-path contexts. Specifically, given a meta-path P u and an object x i of type T 0 , we denote N Tj i as x i 's j-hop neighbor set w.r.t. P u . Then we generate x i 's initial fine-grained embedding by aggregating information from all its j-hop neighbors with j ≤ l. Formally, we have where the learnable parameter matrix W uj corresponds to the j-hop neighbors w.r.t. P u . After that, we put the node embedding h Pu i that aggregates the meta-path context information and the adjacency matrix under the meta-path A Pu into a two-layer GCN encoder to generate x i 's fine-grained embedding: (4.5) z Pu i = Encoder( A Pu , h Pu i ). Note that the encoder here is the same as that used in the coarse view (see Equation 4.3). Further, to improve the model generalizability, we introduce noise to the meta-path induced graph by performing graph augmentation, such as edge masking and feature masking. After the perturbed graph is generated, we feed it into Equation 4.5 to generate the node embedding zP u i . In this way, for each meta-path P u and an object x i , we generate two embeddings z Pu i , zP u i . Given a meta-path set PS = {P 1 , ..., P m }, we can generate Z i = {z Pu i , zP u i |P u ∈ PS} for node x i from various meta-paths. Finally, we fuse these embeddings by the attention mechanism to generate x i 's fine-grained embedding vector z f i : Here, V is the set of target nodes, W att ∈ R d×d is the weight matrix, b att is the bias vector and β s denotes the attention weight.

The MEOW model
After the coarse view and fine-grained view are constructed, we perform contrastive learning to learn node embeddings. Before contrast, we use a projection head to map node embedding vectors to the space where contrastive loss can be applied. Specifically, for x i , we have: After that, we take representations in the coarse view as anchors and construct the positive and negative samples from the fine-grained view. For each node x i , we take z c i as the anchor, z f i as the corresponding positive sample, and all other node representations in the finegrained view as negative samples. Further, to utilize hard negatives and mitigate the adverse effect of false negatives, we learn the importance of negative samples. In particular, we perform node clustering based on the fine-grained representations for M times, where the number of clusters are set as U = {k 1 , k 2 , · · · , k M }. Then, we assign different weights to negative samples of a node based on the clustering results. Intuitively, when the number of clusters is set large, each cluster will become compact. Then compared with hard negatives, false negatives and easy negatives are more likely to be assigned in the same cluster and different clusters with the anchor node, respectively. Therefore, we use γ ij to denote the weight of node x j as a negative sample to node x i and set it as a function F of clustering results. We denote γ ij = F(C 1 , C 2 , · · · , C M ), where C r is the r-th clustering result. In particular, we can understand γ ij as the push strength. For false negatives, γ ij should be small to ensure that they will not be pushed away from the anchor. For hard negatives, γ ij is expected to be much larger because in this way, the anchor and hard negatives can be discriminated. Since easy samples are distant from the anchor, the model will be insensitive to γ ij in a wide range of values. Then based on γ ij , we can formulate our contrastive loss function as (4.8) where τ is a temperature parameter.
To further make embeddings of nodes in the same cluster more compactly distributed in the latent space, we introduce an additional prototypical contrastive learning loss function. In the r-th clustering, we consider the prototype vector c r i , i.e., the cluster center, corresponding to node x i as a positive sample and other prototype vectors as negative samples and define: where θ r i is a temperature parameter and represents the concentration estimate of the cluster C i r that contains node x i . Following [15], we calculate θ r i = Q q=1 z c q −c r i 2 Q log(Q+α) , where Q is the number of nodes in the cluster and α is a smoothing parameter to ensure that small clusters do not have an overly-large θ. Finally, we formulate our objective function L as: where V is the set of target nodes and λ controls the relative importance of the two terms. The loss function can be optimized by stochastic gradient descent. To prevent overfitting, we further regularize all the weight matrices W mentioned above.

Datasets
To evaluate the performance of MEOW, we employ three real-world datasets: ACM [38], DBLP [4] and Aminer [8]. The three datasets are benchmark HINs. We next define a classification task for each dataset.
• ACM: The dataset contains 4019 papers (P), 7167 authors (A), and 60 subjects (S). Links include P-A (an author publishes a paper) and P-S (a paper is based on a subject). We use PAP and PSP as meta-paths. Paper features are the bag-of-words representation of keywords. Our task is to classify papers into three conferences: database, wireless communication, and data mining.
• DBLP: The dataset contains 4057 authors (A), 14328 papers (P), 20 conferences (C) and 7723 terms (T). Links include A-P (an author publishes a paper), P-T (a paper contains a term) and P-C (a paper is published on a conference). We consider the meta-path set {APA, APCPA, APTPA}. Each author is described by a bag-of-words vector of their paper keywords. Our task is to classify authors into four research areas: Database, Data Mining, Artificial Intelligence and Information Retrieval.
• AMiner: The dataset contains 6564 papers (P), 13329 authors (A) and 35890 references (R). Links include P-A (an author publishes a paper) and P-R (a reference for a paper). We consider the metapath set {PAP, PRP}. Our task is to classify papers into four research areas.

Baselines
We compare MEOW with 9 other state-of-the-art methods, which can be grouped into three categories: •[Methods specially designed for homogeneous graphs]: GraphSAGE [6] aggregates information from a fixed number of neighbors to generate nodes' embedding. GAE [14] is a generative method that generates representations by reconstructing the adjacency matrix. DGI [28] maximizes the agreement between node representations and a global summary vector.
•[Semi-supervised learning methods in HINs]: HAN [29] is proposed to learn node representations using node-level and semantic-level attention mechanisms.
• [Unsupervised learning methods in HINs]: HERec [24] utilizes the skip-gram model on each metapath to embed induced graphs. HetGNN [34] aggregates information from different types of neighbors based on random walk with start. DMGI [18] constructs contrastive learning between the original network and a corrupted network on each meta-path and adds a consensus regularization to fuse node embeddings from different meta-paths. Mp2vec [3] generates nodes' embedding vectors by performing meta-path-based random walks. Heco [30] constructs two views with meta-paths and network schema to perform contrastive learning across them. In particular, Heco is the state-of-the-art heterogeneous contrastive learning model.

Experimental Setup
We implement MEOW with PyTorch and adopt the Adam optimizer to train the model. We fine-tune the learning rate from {5e-4, 6e-4, 7e-4}, the penalty weight on the l 2 -norm regularizer from {0, 1e-4, 1e-3} and the patience for early stopping from 10 to 40 with step size 5, i.e., we stop training if the total loss does not decrease for patience consecutive epochs. We set the dropout rate ranging from 0.0 to 0.9, and the temperature τ in Eq. 4.8 from 0.1 to 1.0, both with step size 0.1. We set K in the neighbor filtering based on the average number of connections of all the objects under each Specifically, the number of objects in the label which has the maximum number of nodes is ∼ 7 times more than that in the label which has the minimum number of nodes. It is well known that when labeled objects are imbalanced, AUC is a more accurate metric than the other two. This further verifies that MEOW is effective.

Node Clustering
We further perform K-means clustering to verify the quality of learned node embeddings. We adopt normalized mutual information (NMI) and adjusted rand index (ARI) as the evaluation metrics. For both metrics, the larger, the better. The results are reported in Table  2. As we can see, MEOW outperforms all other methods significantly. In particular, on the ACM dataset, MEOW obtains about 16% improvements on NMI and 25% improvements on ARI over the runner-up method, demonstrating the superiority of our model. This is because the prototypical contrastive learning drives node representations to be more compact in the same cluster, which helps boost node clustering.

Ablation Study
We conduct an ablation study on MEOW to understand the characteristics of its main components. To show the importance of the prototypical contrastive learning regularization, we train the model with L con only and call this variant MEOW wp (without prototypical contrastive learning). To demonstrate the importance of distinguishing hard negatives from false ones, another variant is to not learn the weights of negative samples. We call this MEOW ww (without weight). Moreover, we update nodes' embeddings by aggregating information without considering meta-path contexts in the fine-grained view and call this variant MEOW nc (no context). This helps us understand the importance of including meta-path contexts in heterogeneous graph contrastive learning. We report the results of 40 labeled nodes per class, which is shown in Fig.2. From these figures, MEOW achieves better performance than MEOW wp. This is because the prototypical contrastive learning can drive nodes of the same label to be more compact in the latent space, which leads to better classification results. MEOW outperforms MEOW ww on three datasets. This further demonstrates the advantage of weighted negative samples. In addition, MEOW beats MEOW nc in all cases. This shows that when using meta-paths, the inclusion of meta-path contexts is essential for effective heterogeneous graph contrastive learning.

Conclusion
We studied graph contrastive learning in HINs and proposed the MEOW model, which considers both metapath contexts and weighted negative samples. Specifically, MEOW constructs a coarse view and a finegrained view for contrast. In the coarse view, we took node embeddings derived by directly aggregating all the meta-paths as anchors, while in the fine-grained view, we utilized meta-path contexts and constructed positive and negative samples for anchors. To improve the model performance, we distinguished hard negatives from false ones by performing node clustering and using the results to assign weights to negative samples. Further, we introduced prototypical contrastive learning, which helps learn compact embeddings of nodes in the same cluster. Finally, we conducted extensive experiments to show the superiority of MEOW against other SOTA methods.