Heterogeneous graph embedding with single-level aggregation and infomax encoding

There has been an increasing interest in developing embedding methods for heterogeneous graph-structured data. The state-of-the-art approaches often adopt a bi-level aggregation scheme, where the first level aggregates information of neighbors belonging to the same type or group, and the second level employs the averaging or attention mechanism to aggregate the outputs of the first level. We find that bi-level aggregation may suffer from a down-weighting issue and overlook individual node information, especially when there is an imbalance in the number of different typed relations. We develop a new simple yet effective single-level aggregation scheme with infomax encoding, named HIME, for unsupervised heterogeneous graph embedding. Our single-level aggregation scheme performs relation-specific transformation to obtain homogeneous embeddings before aggregating information from multiple typed neighbors. Thus, it emphasizes each neighbor’s equal contribution and does not suffer from the down-weighting issue. Extensive experiments demonstrate that HIME consistently outperforms the state-of-the-art approaches in link prediction, node classification, and node clustering tasks.


Introduction
Graph structured data appear in many applications, including scientific discovery, social network analysis, web searching (Inokuchi et al., 2003), recommender systems, and geographical data (Miller & Han, 2001).Many techniques have been successfully developed to exploit both the information captured by the graph structure and the features of nodes and Editors: Annalisa Appice, Grigorios Tsoumakas.edges.Most notably, neural network approaches (Kipf & Welling, 2016;Hamilton et al., 2017;Veličković et al., 2017) and network embedding approaches (Perozzi et al., 2014;Tang et al., 2015b;Grover & Leskovec, 2016;Wang et al., 2016;Cao et al., 2015;Goyal & Ferrara, 2018) have continuously set the state of the art in a wide range of problems such as node classification, graph classification, and link prediction.However, these methods mentioned above often share a common homogeneity assumption.Thus, they are not suitable for graph data with various node types and edge types.
In real-world applications such as recommender systems or search engines, the graph data usually contain multiple types of objects (nodes) and relations (edges).Although it is reasonable to use a homogeneous graph learning model with node types and edge types (relations) encoded into node attributes, doing so would compromise the smoothness assumption inherent in many graph neural networks (NT & Maehara, 2019).In fact, node types (and also edge types) are discrete values and often do not share the same structure as node features.Therefore, heterogeneous graphs or in another name heterogeneous information networks (HINs) (Sun & Han, 2013), can capture data in real-world applications more truthfully than homogeneous graphs.
HINs are designed to capture rich semantics and comprehensive information.It is useful for various data mining tasks, such as similarity search (Sun et al., 2011), recommendation (Liu et al., 2014), clustering (Sun et al., 2012), andclassification (Kong et al., 2012).The heterogeneity of HINs has posed a challenge in graph mining, that is how to learn information from multiple types of nodes and edges.Since the state of the art for homogeneous graph representation learning is neural-based graph embedding methods (Perozzi et al., 2014;Tang et al., 2015b;Grover & Leskovec, 2016;Wang et al., 2016;Cao et al., 2015;Kipf & Welling, 2016;Veličković et al., 2017), it is natural to extend these methods to HINs.Early works (Tang et al., 2015a;Dong et al., 2017) support multiple node types and relations, but their node embeddings do not consider target relations.To address this problem, some subsequent works (Ty et al., 2017;Shi et al., 2018b) implicitly consider the target relation as edges or metapath vectors.However, they do not consider neighbor information, which is crucial to the high performance as that in homogeneous graphs (Kipf & Welling, 2016;Hamilton et al., 2017).
Most recently, graph neural networks designed for HINs extend ideas from the homogeneous data literature to efficiently solve problems of heterogeneous data -setting state-of-the-art results .In general, these neural networks use an approach called bi-level aggregation (Fig. 1c) in order to learn the heterogeneous node embeddings.The first level aggregates information of neighbors with the same node types, relations, or meta-paths.Then, the second level employs the averaging or attention mechanism to aggregate the outputs of the first level.Notable bi-level aggregations include RGCN (Schlichtkrull et al., 2018), HAN (Wang et al., 2019), GATNE (Cen et al., 2019), and HGT (Hu et al., 2020b).
In this paper, we find that the bi-level approach may overlook individual node information, especially when there is an imbalance in the number of different typed relations.Take Fig. 1c as an example.Suppose there are many more user interactions than also-buy or also-view relations, then the bi-level aggregation scheme tends to down-weight the information from an individual user.This harms its performance for HIN embedding.
To further investigate this problem, we create a toy HIN, which consists of three types of nodes, namely "user", "item", and "tag" and two relations, namely "user-item" (U-I) and "item-tag" (I-T).To simulate preference of a user and features of an item, we randomly assign four and three tags to each user and item, respectively.Then, we connect an item and a tag if the item is associated with the tag; we add a link between a user and an item if they have two associated tags in common with more than 25% probability.As a result, the graph contains 1,000 users, 100 items and 10 tags and 8,025 U-I edges and 300 I-T edges, representing majority and minority relations, respectively.We generate 10 attributes of nodes according to their associated tags and another 20 non-related noisy attributes according to binary distribution.Table 1 shows the link prediction results of various methods (the experimental settings are described in the experiment section).Surprisingly, we find that bi-level aggregations (RGCN, HAN) are even inferior to traditional aggregations that do not consider heterogeneity (GraphSAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2017)), especially for I-T relation.The reason is that bi-level aggregations downweight U-I relation and get overfitted to the noisy features.More evidence, explanations, and results on the down-weighting issue will be provided in the experiment section and Sect.4.3 Bi-level vs Single-level Aggregation.
In this paper, we propose a simple yet effective single-level aggregation scheme with infomax encoding, named HIME, for unsupervised HIN embedding (Fig. 1b).The key point is that we perform relation-specific transformation to obtain homogeneous embeddings before aggregating information from multiple typed neighbors.Thus, we emphasize the "equal" contribution of each neighbor and thus will not suffer from the down-weighting issue when there are imbalanced numbers of multiple typed relations.Our final embeddings are learned by a loss that encourages closeness between neighbors and an infomax encoder (Fig. 2) to augment graph smoothing in the homogeneous embeddings.As shown in Table 1, HIME outperforms the bi-level aggregation approaches, especially by a large  margin for I-T relation, in the toy HIN.We do extensive experiments using ten benchmark datasets and demonstrate that HIME consistently outperforms the state-of-the-art approaches in link prediction, node classification, and node clustering tasks.We show that HIME is scalable and is able to deal with a large HIN containing 12.8 million edges.Contribution-We make the following contributions: (1) To the best of our knowledge, we are the first to raise the down-weighting issue of bi-level aggregations hindering their effectiveness for HIN embedding and provide concretely empirical evidence.(2) We introduce the heterogeneous single-level aggregation scheme with infomax embedding, a simple yet effective HIN embedding method.(3) We show that our implementation outperforms the latest HIN embedding models in many practical tasks.

Preliminaries
Definition 1 (Heterogeneous Information Network) A Heterogeneous Information Network (HIN) is a tuple (G, T, , R) , where G = (V, E) is an undirected graph with node set V and edge set E ⊆ V × R × V , and R is a set of relations.T is the set of node types and function ∶ V ↦ T maps a node to a single type.Optionally, some dataset include node attributes x � ∶ V ↦ ℝ p , where p is the number of dimensions of the attributes.Note that HINs require |T| > 1 or |R| > 1.
Fig. 2 Architecture of Heterogeneous Graph Infomax Encoding.Following Figure 1, nodes and relations on the left are neighbors with correct relations (positive) to the center node, while nodes and relations on the right are randomly sampled (negative) Problem 1 (Relation-aware embedding for HIN) Given a HIN and a target relation t ∈ R , the problem is to generate low-dimensional representation vectors t i ∈ ℝ d for each node v i ∈ V according to the relation t and d ≪ |V| such that the representations preserve the structure of the given HIN.
Our defined problem is common for unsupervised HIN embedding.To comprehensively embed HIN, we do not look at a specific relation to generate the embedding, but to solve the relation-aware embedding problem across all relations in a given HIN.The main challenge is maintaining a high-quality representation across relations while dealing with the scalability issue.In the following, we first discuss related works on graph embedding and HIN embedding.

Graph embedding
Recently, homogeneous graph embedding methods (Perozzi et al., 2014;Tang et al., 2015b;Grover & Leskovec, 2016;Wang et al., 2016;Cao et al., 2015) have emerged as scalable ways to learn low-dimensional vector representations for graph nodes.These node representations encode semantic information transcribed from the graph and can be used directly as the node features for downstream machine learning tasks.Traditionally, node representation can be obtained by performing the eigendecomposition of Laplacian matrix (Ng et al., 2002).However, due to the complexity of such operation, other heuristics were proposed.The most notable is DeepWalk (Perozzi et al., 2014), where we generate a series of random walks to capture the graph semantic and learn node representations using a skip-gram model (Mikolov et al., 2013).The applications of the skip-gram model can also be seen in recommender systems (Grbovic et al., 2015;Bianchi et al., 2020), context prediction (Lazaridou et al., 2015), etc. Subsequent models to DeepWalk include LINE (Tang et al., 2015b) and node2vec (Grover & Leskovec, 2016).Certain works (Wu et al., 2018;Yang et al., 2019) introduce hashing techniques to reduce the training time and improve the scalability.For a more comprehensive view, we refer to the survey article (Goyal & Ferrara, 2018) on graph embedding.
More recently, researchers proposed Graph Neural Networks (GNNs) as a new class of graph embedding models (Hamilton et al., 2017;Kipf & Welling, 2016;Veličković et al., 2017).While not learning the node embedding explicitly, these models implicitly learn the node embeddings by combining node attributes and graph structures in neural-based models.We refer to the graph neural network survey article (Wu et al., 2019) for more details in this literature.
Since real-world data such as the Amazon data (Fig. 1a) are intrinsically heterogeneous, it is not trivial to extend graph embedding methods to work with heterogeneous data.Many HIN embedding models have been successful in bridging this gap.

HIN embedding
In early works of mining HINs, many methods (Sun et al., 2011;Liu et al., 2014) use metapaths as semantic information to underline the difference between HIN and homogeneous network.As random-walk based models become popular, there are many attempts to combine the concept of meta-paths and skip-gram models.The notable works are metapath2vec and HIN2vec.metapath2vec (Dong et al., 2017) formalizes meta-path-based random walk and then utilizes heterogeneous skip-gram models.HIN2vec considers a neural network model for capturing and differentiating the information of meta-paths.In the meantime, another line of work (Xu et al., 2017;Tang et al., 2015a;Shi et al., 2018a) treat the graph under each relation as a subgraph, then they jointly learn from those subgraphs.
Researchers have recognized the problem of incompatibility of heterogeneity in relations (Shi et al., 2018b;Chen et al., 2018).To alleviate this problem, they propose relation-specific projection or implicitly consider it as relation embedding.With the success of GNNs, many works follow these architectures and adapt them to HINs.HAN (Wang et al., 2019), GATNE (Cen et al., 2019) and GTN (Yun et al., 2019) utilize attention mechanism after aggregating on subgraph-level separated by node types, relations, or meta-paths.We refer to these methods as bi-level aggregations.However, none of these concerns about the contribution of each individual neighbor to the final node representations.

Proposed method
We present a simple yet effective relation-aware heterogeneous graph embedding algorithm called Heteregeneous graph InfoMax Encoder (HIME) to keep the awareness of relations in heterogeneous graphs without suffering from the down-weighting issue.We first describe our encoder to learn node representations, then we introduce a mechanism to maximize the mutual information inside the encoder.

Relation-aware node embedding
Initially, each node v i is associated with a feature vector (0)  i ∈ ℝ d which is shared across relations.In case node attributes are available, we project them to a d-dimensional space by (0) i = a � i where a ∈ ℝ d×p is a learnable weight matrix for the projection and � i ∈ ℝ p denotes attribute of v i .To get the final representation of node v i for relation t, we combine the information of node v i , its neighbors N i , and relation t together.The process is fulfilled by a single or a stack of our proposed heterogeneous single-level aggregator(s).

Heterogeneous single-level aggregator
As shown in Fig. 1b, given an object of interest, we transform its neighbor features according to their relations to v i , j,r = h( j , r) where v j is a node connected to v i by relation r; h allows a node to pass distinct feature for different relations.The transformation can be a hyperplane translation (Wang et al., 2014) or linear transformation.We call ⋅,⋅ as an edge vector.Then we aggregate edge vectors using the mean function: . Notice that we perform the neighbor rela- tion-specific transformation before the aggregation to emphasize each neighbor's contribution rather than diluting the node's information by fusing it with neighbors of the same relation.Furthermore, we found that the transformation should be carefully selected.For example, the hyperplane translation assumes that all edge vectors after the transformation still belong to the same feature space but in different hyperplanes.This can be beneficial for certain heterogeneous graphs, where the information between relations is highly inclusive so that knowing other relations can help to inform the structure of a target relation.
Next, we combine the information of node v i ( i ) and its neighbors ( i ) by using gated recurrent units (GRU) (Cho et al., 2014): i ← GRU ( i , i ) .We treat i as the hidden state of a recurrent model.Alternatively, simplified versions (Chairatanakul et al., 2019) of GRU can be used to reduce the model complexity and possibly improve the performance in sparse datasets.The motivation behind using GRUs is based on the oversmoothing issue of GNNs; That is the degradation of performance of GNNs when increasing the number of layers because of the similarity in node representations (Li et al., 2018;Oono & Suzuki, 2019).Note that GRU is commonly used in learning from sequential data that retaining past information is crucial, although it requires heavy computational resources.We hypothesize that GRU can extract new information of the current layer while retaining useful information from the previous layer.Any homogeneous GNN can also be used after obtaining edge vectors by treating them as neighbors in a homogeneous graph.Note that a slight modification of bi-level aggregation can turn it into single-level aggregation.However, doing that will contradict the purpose of the second-level that aims to reassign the weight based on relations.
In the last layer, we employ a distinct GRU for each target relation t.The formula can be written as follows: The motivation is to allow the node rep- resentation to have both shared information across all relations and unique node's relational information.This can be realized by looking at updating node vector of GRU t : , where z t i is the update gate vector and ̂ t i is the relation-specific candidate vector.

Objective function
To preserve the structure of HIN, we encourage the closeness of embeddings for nodes connected by an edge, while enforcing the separation of embeddings for nodes unconnected.Therefore, we minimize the following loss: where ⟨⋅, ⋅⟩ denotes the inner product.The loss is derived from Bayesian Personalized Ranking (Rendle et al., 2009).It is commonly used in recommender systems (Ricci et al., 2010) and is similar to margin-based ranking loss (Bordes et al., 2013).For scalability, we use negative sampling, which will be informed later in Sect.3.5.

Heterogeneous graph infomax encoding
While the proposed aggregator in the previous section is flexible and powerful in learning local structures, using multiple transformations can result in high heterogeneity.That is, the edge vectors conflict with each other.This impairs the graph smoothing which is an essential characteristic of GNNs (Chen et al., 2020;Xie et al., 2020).We want to avoid this scenario and, at the same time, encourage the model to capture unique local features.Therefore, we aim to maximize the mutual information between edge vectors and the output of the layer.Similar to DGI (Veličković et al., 2018) and DIM (Hjelm et al., 2018), we use the binary cross-entropy between samples from the joint (positive) and the product of marginals (negative): (1) where th layer and Ñi is a corrupted neighbors of node v i .As analogous to DIM, the vicinity centering around v i in the graph is treated as a single image.To increase the mutual information, the encoder needs to consider every neighbors and aggregates the information in which most neighbors agree.Notice that for each layer, we apply the loss separately since it is easier to apply mini-batch training.
Following DGI, we adopt its discriminator: D( , ) = ⊺ , where is a weight matrix of the discriminator.We call the process of optimizing Eq. ( 2) via a discriminator inside each layer of the single-level aggregator as infomax encoding.Note that our motivation for the infomax encoding is different from the graph infomax in DGI, DMGI (Park et al., 2020), and HDGT (Ren et al., 2019).In their works, the main objective is to preserve the mutual information between local patches and global graph summary.We will call them as global infomax.
The advantages of the proposed infomax encoding over global infomax are two-fold.First, infomax encoding is compatible with mini-batch sampling, whereas global infomax needs to compute the embeddings for the whole graph to obtain the global summary.This makes infomax encoding scalable to a large graph without any further modification.The second is for promoting homogeneity between neighbors or edge vectors in the graph.To make a comparison, we optimize the model in the previous section via either infomax encoding or global infomax1 while measuring the homogeneity between neighbors.To measure the homogeneity, we use the cosine similarity between neighbors of the same node and between any random edges following Chen et al., (2020).Intuitively, we want the high value from the neighbors telling that the neighbors contain similar information, while we want the low value from the random indicating oversmoothing.The results are plotted in Fig. 3. we can clearly see the large gap between the value from the neighbors and the value from the random in Fig. 3b.This indicates that infomax encoding can encourage the homogeneity while keeping the oversmoothing in control.On the other hand, Fig. 3a (2) Fig. 3 Average of the cosine similarity of edge vectors between neighbors of the same node (green line) and between any random edges (grey line).The green area shows the gap between them.The larger the gap, the better homogeneity and lesser prone to oversmoothing informs that global infomax does not deliver the same desirable.More benefits of infomax encoding will be presented in Sect.4.4 Study of Infomax Encoding.

Model Optimization
For the final objective, we jointly optimize the loss from the graph context in Eq. ( 3) and the infomax loss in Eq. ( 2): where is a hyper-parameter controlling the importance of the infomax loss.Since it is computationally expensive to minimize the above loss directly, we adopt the negative sampling technique.Specifically, we uniformly sample an edge (v i , t, v j ) from the graph as a positive sample, then we uniformly sample K negative nodes that have the same types as v j and do not have relation t to v i .For further improvement on the negative sampling, "hardsamples" negative sampling (Zhu et al., 2021) or adversarial negative sampling (Hu et al., 2019;Sun et al., 2019) can be considered.The effect of negative sampling in homogeneous graphs have also been investigated by Qiu et al., (2018); Yang et al., (2020).Note that the model can perform in semi-supervised manner by changing the loss function L G to a loss function for multiclass classification such as cross-entropy loss.However, we found that the semi-supervised loss usually converges significantly faster than the infomax loss.The slower convergence of the infomax loss suggests that the model may underutilize the infomax encoder in such a condition.
To make the model applicable to large graphs, we follow the architecture of Graph-SAGE (Hamilton et al., 2017) to be able to generate node representation individually and utilize mini-batch training.In particular, for generating the representation of a node, namely v i , we uniformly sample up to n neighbors of v i , where n is a hyper-parameter.Subsequently, for each sampled neighbor, we perform the sampling process of that neighbor.We repeat the process for L times.The aggregation considers only these samples and v i for generating the representation of v i .In this way, we can limit the memory usage in the aggregation process to O(n L T) , where T is the space complexity of the transformation h, compared with O(|E|T) using the whole graph.In practice, L is usually set to a small num- ber due to the oversmoothing effect.We obtain nodes' corrupted neighbors by shuffling the nodes' neighbors of the same mini-batch.Specifically, where is a permutation function, B and T denote the arrays of target relations and node indexes of a mini-batch, respectively.The reason is to reduce computation cost.Since we need to perform l − 1 encoding layers to has (l−1)  k,q in Eq. ( 2), by shuffling the neighbors inside a mini-batch, (l−1)  k,q has been calculated already in the positive side and can be reused in the negative.

Experiments
We conduct extensive experiments to compare HIME with the state-of-the-art models on link prediction, node clustering, and node classification tasks.We also analyze the benefits of the proposed single-level aggregation over bi-level aggregation.Moreover, we provide a thorough investigation into the different aspects of HIME, including the effect of ( 3) infomax encoding, the scalability, hyper-parameter sensitivity, and its adaptability to other frameworks.

Datasets
Aside from our synthetic toy dataset, we use ten publicly available real-world HIN datasets.We divide the datasets into three groups.The first group is datasets where both node types and relations are more than one: DBLP (Hu et al., 2019), Yelp ( Hu et al., 2019), Douban Movie (Shi et al., 2019), Douban Book (Shi et al., 2019), and Amazon-Large2 (Ni et al., 2019).The second group is multiplex networks ( |T| = 1 and |R| > 1 ): Amazon, You- Tube, and Twitter.We obtain the data from Cen et al., (2019).The last group is multiplex networks with node labels: ACM and IMDB.We obtain node labels and attributes from Park et al., (2020) for node clustering and classification evaluation purpose.Basic statistics of the datasets are summarized in Table 2.Additional details can be found in Sect.A in the appendix.
Note that all GNNs for HIN in the list are bi-level aggregations.Because the core idea of DMGI (Park et al., 2020) and HDGI (Ren et al., 2019) is similar but DMGI has an improved regularization, we select DMGI to represent global infomax approach for multiplex networks.GATNE in Link Prediction section refers to GATNE-T while GATNE in Node Clustering and Classification section refers to GATNE-I.GATNE* refers to a variant of GATNE by changing its initialization and loss function to Eq. ( 1).HAN* refers to a variant of HAN by changing the semantic attention to be an independent and learnable parameter.For a fair comparison, we fix the embedding dimensions d to 128.Without specifying, all models are trained in unsupervised manner.Please see Sect.B in the appendix for additional details about the implementation and hyper-parameter settings.

Link prediction
To evaluate the quality of embedding methods on preserving the information of a HIN, we conduct link prediction experiments following Shi et al., (2018b).For each HIN, we split its edges into three sets for training, validation, and testing with numbers 85, 5, and 10% from the total, respectively.An evaluated model is trained on the training set, while the validation set is used for stopping criteria and finding suitable parameters.The task is to predict the edges in the test set using the learned embeddings from the training set.We sample 10 negatives instead of all possible candidates because of computation cost.We rank positive edges among both positives and negatives and report the mean reciprocal rank (MRR) by micro-and macro-average.In particular, the micro-average MRR averages all reciprocal rank without considering relations, whereas macro-average MRR averages over the mean of values of reciprocal rank associated with each relation.Mathematically, where E test denotes the test set and c t denotes the number of test edges with a relation t in the test set.We average over relations for which their appearances exceed 5%.Table 3 lists the results.We observe that HIME achieves the best performance for all cases.Particularly, HIME significantly outperforms the baselines, with p-value < 0.01. ),

Link prediction on multiplex networks
In multiplex networks, nodes of the same type are connected to each other with multiple relations.We conduct link prediction in multiplex networks to see whether our model can handle effectively this scenario.We obtain data from Cen et al., (2019) and conduct experiments following them.We use their source code for evaluation and report the performance.We report a summary of the results in Table 4.We observe that HIME shows superior performance over other models.The highest gap in performance is in the YouTube dataset, which has the highest number of relations among those three datasets.This implies the effectiveness and distinguishability of the model for dealing with multiple relations.Although DMGI and DGI aiming to capture global properties are good for node clustering and classification, they are inferior to models with graph context optimization (first-and second-order in graphs) for the link prediction task.

Node clustering and classification
Node clustering aims to group nodes belonging to the same class, while node classification is to identify their classes which can be considered as either multi-class classification or multi-label classification (Tsoumakas and Katakis, 2007;Do et al., 2019).We conduct experiments for both cases following Park et al., (2020) with the same data split for training, validation, and test sets.For methods that ignore node attributes, we concatenate the raw attributes with the learned node representations following Park et al., (2020).Since our methods generate multiple node representations for different edge types, for a fair comparison, we select the node representations of an edge type that yields the best results in the validation set, then evaluate the performance on the test set.In practice, we can combine such different node representations and use fast and highly scalable feature selection methods including ensemble techniques to enhance the performance.We report normalized mutual information (NMI) for node clustering 5 , and macro-(MaF1) and micro-average F1 (MiF1) for node classification.For comparison, we include bi-level aggregations: RGCN, HAN, and HGT trained in unsupervised manner with Eq. ( 1) as the objective function.
The results are summarized in Table 5. HIME demonstrates the best performance, while DMGI is the second best.For node classification, we observe that our unsupervised model HIME even outperforms the semi-supervised model HAN [ † ].For node clustering in ACM dataset, HIME achieves performance gain as high as 4.9% over DMGI which is significantly superior to other baselines.

Bi-level vs single-level aggregation
In this section, we explain why bi-level aggregation performs worse than single-level aggregation for relation-aware HIN embedding.First, we look at where the performance is improved by singlelevel over bi-level aggregation.We conduct further experiments for the link prediction task on Douban Book dataset, which has the largest number of relations by switching between graph convolutional methods using the same objective function as defined in Sect.3.1.For bi-level aggregations, we include well-known methods such as RGCN (Schlichtkrull et al., 2018), HAN (Wang et al., 2019), and HGT (Hu et al., 2020b), and for single-level aggregations, we include commonly used methods such as GraphSAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2017) and also our proposed HIME.We also include our proposed single-level aggregator without infomax encoding as "HIME (-IM)".
Figure 4 presents the link prediction results for each relation.We can see that the singlelevel aggregations significantly outperform bi-level aggregations in minority relations.To explain why bi-level aggregations perform worse, we aim to investigate the importance of each relation to node representations for a minority relation.Intuitively, an appropriate aggregator should transfer abundant information from majority relations to the minority.One can simply look at the attention score of the bi-level.However, it lacks the consideration of the magnitude of a feature and the transformation along the aggregation path.
To provide a better concrete analysis, we derive an idea from a neural network pruning technique SNIP (Lee et al., 2019a) which can also be considered as an application of an attribution method gradient*input (Shrikumar et al., 2017).SNIP uses the derivative of a loss function with respect to an auxiliary variable c representing connectivity of a parameter w.The purpose is to find important connections based on the change of c called saliency score.To measure the importance of a node v i , the saliency score of the node s i can be calculated as: where is a set of the model's parameters, (0)  i denotes an initial feature vector of node v i before the aggregation, and abs denotes a function calculating absolute value of each ele- ment in an input.The higher saliency score, the higher significance of a node.
Figure 5 shows the average nodes' saliency score of each relation to a target relation's node representations.Note that Fig. 5a-c are from bi-level aggregations, while Fig. 5d-f are from single-level aggregations.RGCN has very high saliency scores on minority relations because it uses the mean across relations in the second-level.HAN* uses attention mechanism in the second-level that can slightly gravitate the significance to majority relations and improve the performance.HGT deploys an attention mechanism considering each message from a neighbor instead.This can alleviate the downweighting problem and make its results closer to those of single-level aggregations than others.However, we observe that for most target relations, HGT has very high saliency ( 4) scores on the same relation.This suggests that it tends to underutilize the information of HIN.However, the saliency scores of majority relations,"U-G" and "U-U", of bilevel aggregations are still much lower than those of minority relations for most cases.This demonstrates that bi-level aggregation scheme down-weights the information of the majority.On the other hand, Fig. 5d-f support that single-level aggregations much less suffer from the down-weighting issue.However, the disadvantage of both GraphSAGE and GAT is the lack of awareness of relations in the aggregation that reduces their powerfulness when performing on HIN.For example, if we aim to directly optimize link prediction for each relation instead of relationaware embedding.Figure 6 informs us that bi-level aggregation is better than single-level aggregation (excluding HIME) in all cases, following a similar finding by other researchers.This suggests that there is room for improvement to find a method that considers relations in the aggregation without suffering from the down-weighting issue.By considering relations in the aggregation to improve the powerfulness alone, we can see the increase in performance in HIME (-IM), while still performing well for minority relations in Fig. 4, unlike bi-level aggregations.Finally, by incorporating infomax encoding to encourage graph smoothing, we observe that HIME outperforms HIME (-IM) in all cases in Fig. 6 and most cases in Fig. 4 implying that it can utilize more information from graph structure.We can conclude that HIME is effective and powerful for HIN embedding and does not suffer from the down-weighting issue, satisfying both requirements.

Study of infomax encoding
In this section, we provide study and analyze of the effect of the infomax encoding.For these purposes, we conduct further experiments with three settings: insusceptibility to attack on node features, horizontal improvement, and vertical improvement.

Insusceptibility to attack on node features
As we introduce the infomax encoding, we hypothesize that the infomax encoding augments the graph smoothness.Feng et al., (2019) show that implementing the graph smoothness in the prediction stage can increase the performance and robustness against attacks on node input features in node classification, and Jin & Zhang (2019) obtain a similar achievement by perturbing the latent presentation in GCN.We will demonstrate that the infomax encoding follows the same conjecture and improves the robustness.
We choose the node classification task as in Sect.4.2.3.We perform node feature attacks on either 5% uniformly sampled nodes or top 5% of nodes ranked by node degree.The attacks are 1.Noise: injecting Gaussian noise N(0, 0.01) 2. Shuffle: row shuffling between the attacked nodes 3. Zero: substituting constant for its features.Then we train HIME and evaluate the performance on an attacked graph.The results are collected and presented in Table 6.We can see that HIME with infomax encoding outperforms its variance without infomax encoding (-IM) regardless of attack types and attacked nodes.

Horizontal improvement
The second setting is to see its benefit when going wide (horizontal) in the graph.We set the number of layers L to 1, gradually increase the number of sampled neighbors N from 5 to 100, and run the model with and without the infomax encoding.We plot the performance of the model against the number of sampled neighbors in Fig. 7.We can see that the model achieves higher performance with infomax encoding than without it when N is high enough.This follows our assumption that infomax encoding can improve the usage of each neighbor's information.

Vertical improvement
In the last setting, we investigate whether the infomax encoding can improve the performance when going deep (vertical) in the graph.We conduct additional experiments by varying the number of layers L from 1 to 5, which can reach up to five-hop neighbors, and run the model with and without using infomax encoding in all layers.Due to the limitation in our memory space, we set the number of neighbors to 50, 20, 8, 5, 3 for L from 1 to 5.
The results are listed in Table 7.As can be seen, the model with infomax encoding outperforms the one without it in all cases.The performance gain becomes lower when the number of layers is higher.The reason is that when the number of layers is higher, (1) the number of sampled neighbors becomes lower, and (2) the smoothing effect of the GNN becomes stronger, leading to more homogeneity when infomax encoding is not used.

Oversmoothing issue
First, we investigate whether HIME suffers from the oversmoothing issue, which is a common issue in GNNs.In addition, we aim to provide empirical evidences to support our motivation behind using GRUs.Therefore, we introduce two variants of the singlelevel aggregator by replacing GRU ( i ← GRU ( i , i ) ) with GCN and RGCN styles: , where denotes the rectified linear unit (ReLU), is a learnable transformation matrix, and is a balancing factor between the information of a node and its neighbors that equals to 1 |N i |+1 for GCN style and 1 |R|+1 for RGCN style.We conduct additional experiments for the GCN and RGCN styles by varying the number of layers L from 2 to 5 with the same setting as in Sect.4.4.3.The results are listed in Table 8.We observe that all variants achieve comparable results when L = 2 .However, as the number of layers increases, the performance gap becomes larger on both datasets, and the GRU variant retains the performance the most among all of them.The results demonstrate the possibility of using GRU to alleviate the oversmoothing issue.To further investigate the difference among them, we use Mean Average Distance (MAD) between node representations (Chen et al., 2020) to measure the smoothness of the representations.The distance is defined as one minus the cosine similarity between the representations between a pair of nodes.Lower MAD value indicates smoother the node representations in a graph, where the zero value means that all the node representations become indistinguishable.The results are shown in Fig. 8.We observe significant drops in the MAD values of the GCN and RGCN variants on Yelp dataset indicating the possibility of oversmoothing.In contrast, the MAD values of the GRU slightly reduce when the number of layers increases on both datasets.This supports that HIME with GRU (default) does not suffer from the oversmoothing issue.

Training time, scalability, and convergence analysis
We aim to analyze the training time of HIME and compared it with bi-level aggregation models.We conduct experiments by training models on Douban Movie with varying the number of training edges.To see it clearly, we include the time for variants of HIME with varying the transformations h and with/without infomax encoding (IM).We plot the results in Fig. 9.As can be seen, the characteristics of the training time can be vary depending on models.For HIME, we find that its training time linearly grows with the number of edges, but the slope is smaller compared with GATNE, HAN, and HGT because of the less training steps until convergence.The faster convergence is probably due to the simplicity of single-level aggregations compared with those of bi-level aggregations employing attention mechanisms.On the other hand, RGCN demonstrates the similar training time as HIME because RGCN employs the mean pooling instead. .Furthermore, to investigate the scalability of HIME, we run HIME on Amazon-Large dataset consisting of 12.8 million edges.We compare the link prediction performance of HIME to its variant without the infomax encoding.Table 9 shows the performance and its training time on a single GPU, NVIDIA Tesla V100.HIME with the infomax encoding performs better while costing a tiny amount of time.We plot the convergence curve  of HIME in Fig. 10.We observe that the training loss is steadily converged and inversely proportional to the performance on the validation set.

Ablation study
We provide ablation study to clarify the contributions of our work in the methodology over RGCN.Thus far, our contributions are as follows.
(i) We proposed single-level aggregation (Sect.3.1), which uses mean pooling over all types of neighbors instead of averaging over each type of neighbors in RGCN.(ii) We introduced the transformation h (Sect.3.1) that should be carefully selected and fine-tuned, unlike RGCN that uses only linear transformation.(iii) We proposed infomax encoding (Sect.3.4), which can reduce the heterogeneity and promote the homogeneity between neighbors in the graph.
To demonstrate the effect of these three contributions, we perform ablation study of HIME by considering three variants of it that illustrate the contribution of each point on the top of its previous.Note that for (ii), we use the hyperplane transformation to distinguish from the linear transformation in RGCN.
The results are listed in Table 10.We observe that the effect of each contribution is significantly different and depends on the datasets.Specifically, the first contribution shows the highest performance gain over other contributions on Douban Movie and Douban Book datasets, where the average degree of a graph is high and the edges are heavily dominated by the majority relations (please see Table 12 in Appendix A for the statistics).Such a situation will cause a severe down-weighting issue in bi-level aggregations, and that explains the performance gain from the first contribution.Conversely, it performs significantly worse than RGCN on DBLP, where the average degree is much lower compared with Douban Movie and Douban Book datasets.This empirically supports the application of RGCN that was originally proposed for knowledge graphs, which usually have a small number of average degree.The second contribution positively yields significant improvements in all cases.Please be reminded that we have thoroughly investigated the effect of the third contribution in Sect.4.4.

Hyper-parameter sensitivity
We conduct parameter sensitivity analysis by adjusting important hyper-parameters: d and with L = 1 .We plot the model performance against them as shown in Fig. 11a, b.As d increases, the performance is raised until becoming plateau when d > 100 .Then, the per- formance of the model is slightly affected by d when setting d high enough.On the other hand, the performance improves as increases from 0.001 to 0.1, then it declines.HIME achieves the best results at around = 0.1 , where the infomax loss positively contributes to the performance.

Adaptability of HIME to other frameworks
We investigate whether HIME can be used on other frameworks with a similar setting.We select GPT-GNN (Hu et al., 2020a) which is a framework for pretraining graph neural network and applicable for HINs.GPT-GNN aims to generate or reconstruct an input graph which is similar to the objective of HIME.The main difference between GPT-GNN and HIME is that GPT-GNN operates on subgraph sampling while HIME samples an observed edge, then generates relevant node representations in the training stage.We conduct experiments based on GPT-GNN framework by comparing augmented HIME (HIME*) to HGT.HIME* uses relative temporal encoding, as in HGT, in relationspecific transformation to generate edge vectors.However, to keep the concept of singlelevel aggregation, HIME* does not use any attention.We use the same hyper-parameter setting 6 as being provided without tuning.We use Open Academic Graph (OAG) data on computer science (CS) provided by the GPT-GNN authors.we follow the original setting for time-transfer, then evaluate the models on three downstream tasks: prediction of Paper-Field (PF), Paper-Venue (PV), and Author Disambiguation (AD).Please see Sect. 4 in GPT-GNN literature for more details.
Table 11 shows the performance on the downstream tasks of both models.We observe that HIME* shows its superior over HGT with and without pretrain for most cases.This demonstrates that the concept of HIME, a heterogeneous single-level aggregator with infomax encoding, is applicable to other frameworks.

Conclusion
In this work, we proposed the single-level aggregation scheme and the application of infomax to learn node embedding for heterogeneous information networks.The single-level aggregation scheme is not only simpler than the bi-level scheme adopted by the state-ofthe-art methods but it also has higher performance across many benchmark tests.The proposed infomax helps in bridging heterogeneous embedding to homogeneous embedding by encouraging graph smoothness and allows scalability.We conducted extensive experiments to verify and compare the performance of our model with the state-of-the-art methods.
Our results with single-level aggregation justify that the bi-level aggregation scheme down-weights some popular node types and edge types (such as users and user interactions) by design.In the light of this, it is beneficial to use our single-level aggregation scheme in future studies as a benchmark method.
For future direction, we aim to investigate a way to combine existing homogeneous GNNs and HIN embedding frameworks efficiently to accommodate a variety of homogeneous GNNs to domain.

Appendix A: Additional statistics of datasets
Table 12 provides additional details including node types and relations.We choose to conduct experiments on ACM and IMDB with metapath pre-processing because of the extreme sparsity of the originals.For additional details about Amazon, YouTube, and Twitter datasets, please see the appendix of GATNE literature (Cen et al., 2019).
we set the number of walks per node to 10, the walk length to 100 and the window size to 5. For knowledge graph embedding methods including TransH (Wang et al., 2014), Dist-Mult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), and RotatE (Sun et al., 2019) , we use the source code 7 provided by Sun et al., (2019).We use the binary cross-entropy loss for training DistMult and ComplEx.Note that ComplEx and RotatE embeds the node representations in a complex space ℂ d , whereas other methods embeds them in a real space ℝ d .For HAN and HGT, we set the number of multi-head attention to 8 and the output size of each head to 16.For a fair comparison between single-level and bi-level aggregations, we use the same hyper-parameter setting 8 for RGCN, HAN, HGT, and HIME without finetuning.For datasets which have no node attributes, all initial feature vectors x (0)   * are randomly initialized and learnable parameters.

Fig. 1 a
Fig. 1 a A snapshot of the Amazon data where the node of interest is in a circle (the laptop); b Example for our proposed the single-level aggregation scheme; c Example for the bi-level aggregation scheme where attention is required.Our model introduces relation projections and infomax learning, which replace the attention mechanism in state-of-the-art approaches

Fig. 4
Fig. 4 Link prediction performance of models for each relation in Douban Book dataset by optimizing relation-aware embedding.The number in round brackets indicates the percentage of the relation

Fig. 5
Fig. 5 Average saliency score of nodes connected to output nodes via each relation (x-axis) for generating the node representations for a target relation (y-axis) on Douban Book dataset

Fig. 6
Fig. 6 Link prediction performance of models for each relation in Douban Book dataset by optimizing link prediction task for each relation.The number in round brackets indicates the percentage of the relation

Fig. 8 Fig. 9
Fig. 8 MAD values of HIME with different aggregation styles and layers.Lower MAD value indicates smoother the node representations in the graph.The GRU variant is the least prone to the oversmoothing issue

Fig. 10
Fig. 10 Convergence curve for HIME on Amazon-Large dataset.The plotted performances are on the validation set Fig. Parameter

Table 2
Statistics of datasets

Table 3
Micro-and macro-average MRR results of each model in link prediction task in HINs.Bold texts indicate the best result in each case.Underline texts indicate the best among baselines.⋆ indicates the significant improvement of HIME over baselines, with p-value < 0.01

Table 6
Node classification performance of HIME on IMDB graph with node feature attacks.Performance of HIME with/without infomax w.r.t. the number of sampled neighbors

Table 7
Performance (Microavg.MRR) of the model with and without infomax encoding by varying the number of layers

Table 9
Performance (MRR) and time analysis in link prediction task on Amazon-Large dataset."t/step" and "T" denotes time per step and total training time, respectively

Table 10
Ablation study of HIME in comparison to RGCN in the link prediction task.Bold values indicate the best result for each dataset and metric

Table 11
Performance of HGT and HIME* on multiple tasks with and without pretrain.Bold values indicate the best result for each downstream task and metric