Memory-Enhanced Transformer for Representation Learning on Temporal Heterogeneous Graphs

Temporal heterogeneous graphs can model lots of complex systems in the real world, such as social networks and e-commerce applications, which are naturally time-varying and heterogeneous. As most existing graph representation learning methods cannot efficiently handle both of these characteristics, we propose a Transformer-like representation learning model, named THAN, to learn low-dimensional node embeddings preserving the topological structure features, heterogeneous semantics, and dynamic patterns of temporal heterogeneous graphs, simultaneously. Specifically, THAN first samples heterogeneous neighbors with temporal constraints and projects node features into the same vector space, then encodes time information and aggregates the neighborhood influence in different weights via type-aware self-attention. To capture long-term dependencies and evolutionary patterns, we design an optional memory module for storing and evolving dynamic node representations. Experiments on three real-world datasets demonstrate that THAN outperforms the state-of-the-arts in terms of effectiveness with respect to the temporal link prediction task.


Introduction
Graph representation learning, as an important task in machine learning, has significant practical value in areas such as social networks and recommendation systems. Existing graph representation learning methods usually take static graphs as the input to obtain low-dimensional embeddings by encoding local non-Euclidean structures and have achieved extensive excellent performance in downstream tasks such as link prediction [1][2][3], node classification [4,5], and graph classification [6,7].
However, most graphs in the real world are naturally heterogeneous and dynamic, which cannot be accurately represented by static homogeneous graphs. Several studies incorporate heterogeneous data models into a unified graph model [8], promoting the research of graph data. Taking the example of a user-item interaction network in e-commerce scenarios, illustrated in Fig. 1a, there are two types of nodes (i.e., user and item) and three types of interactions (i.e., favorite, browse and buy). Additionally, each interaction is associated with a continuous timestamp to indicate when it occurred. In this paper, we define such interaction sequences In the case of the user-item interaction network shown in Fig. 1a, THG representation learning has the following challenges compared to static homogeneous graph representation learning: • (C1) How to model the heterogeneity? The nodes and edges in THG are of various types and have rich semantics, making it difficult to obtain sufficient heterogeneous information just by encoding local graph structure. • (C2) How to model the continuous dynamics? The edges in the THG are time-informed and time-dependent, i.e., each event occurs with a timestamp and current event may affect the occurrence of future events. For instance, there might be causal relationships between the interaction of searching for headphones on June 18 and the interaction of purchasing headphones on November 11 by user A. Therefore, both efficient methods of converting temporal information into dynamic features and temporal constraints are needed to avoid violating the temporal causality between interactions.
• (C3) How to deal with new nodes? The dynamics of the THG imply that new nodes will emerge in the future (e.g., users D and E are two new nodes that appeared on November 11 compared to June 18). In other words, these nodes are not present during training and many practical applications will require their embeddings to be generated in a timely manner. Therefore, it is necessary to construct an inductive modeling approach that generalizes the optimized representation to the new temporal subgraphs.
As for the heterogeneity, earlier methods [9,10] preserve heterogeneous information by designing semantic metapaths to generate heterogeneous sequences, and recent studies [2,[11][12][13] aggregate information from heterogeneous neighborhood by extending the message-passing process of graph neural networks (GNNs). Concerning dynamics, it is general to split temporal graphs into several static snapshots (i.e., discrete-time dynamic graph, DTDG [14]) and use RNNs or attention to capture the evolutionary patterns between snapshots [15][16][17][18]. Although these methods can learn graph dynamics of the THG to some extent, the temporal information within the same snapshot is usually ignored, and the scale of snapshots needs to be predetermined in advance. Recently, researchers have proposed continuous-time dynamic graph (CTDG [14]) approaches [19][20][21][22][23][24] to capture dynamics via passing information between different interactions, or using continuoustime functions to generate temporal embedding. In regard to the new nodes, inductive graph representation learning methods [5,22,23,25] recognize structural features of node neighborhood by learning trainable aggregation functions, so that rapidly generate node embeddings in new subgraphs. Plenty of studies have attempted to solve the above challenges, nevertheless, few approaches can address them at the same time. Temporal heterogeneous graph. Different colored lines represent dif-ferent interactions, where the blue line denotes favorite (i.e., the heart icon), the green line denotes browse (i.e., the magnifier icon), and the orange line denotes buy (i.e., the wallet icon) 1 3 In this paper, we propose a novel Temporal Heterogeneous Graph Attention Network (THAN), which is a continuous-time THG representation learning method with Transformer-like attention architecture. To handle C1, we design a time-aware heterogeneous graph encoder to aggregate information from different types of neighbors. To handle C2, THAN samples temporally constrained neighbors and learns time-aware representation from historical heterogeneous events for a given node at any time point. It also encodes time information with a time encoder and incorporates them into the message propagation process. To handle C3, THAN can be thought of as a local aggregation operator based on neighbor sampling that recognizes the structural properties of a node's neighborhood and does not introduce global priori information.
THAN generates dynamic embeddings of nodes from their most recent neighbors. However, long-term dependencies and evolutionary patterns are not considered. Moreover, high-order information can be captured by stacking multiple THAN layers, but the cost is huge. To address these problems, we design an optional memory module to store and evolve the node representations. This module dynamically updates the node states as events occur and provides indirect access to distant neighbors by adding node memories to the raw inputs of THAN. The main contributions of our work are summarized as follows: • We propose an inductive continuous-time THG representation learning method, which can capture both heterogeneous information and dynamic features. • We introduce the dynamic transfer matrix and self-attention mechanism to implement the information aggregation of heterogeneous neighbors. • We devise an optional memory module to enhance the representational ability of THAN by storing and updating the dynamic node states. • We conduct experiments on three public datasets and the results demonstrate the superior performance of THAN over state-of-the-art baselines on the task of temporal link prediction.
The rest of the paper is organized as follows. We review related work in Sect. 2, and formulate the problem of temporal heterogeneous graph representation learning in Sect. 3. In Sect. 4, we discuss the critical techniques of THAN. We report a systematic empirical evaluation in Sect. 5, and conclude the paper in Sect. 6.

Related Work
Our work is related to representation learning on static graphs, temporal graphs (i.e., dynamic graphs), and selfattention mechanism on graphs.
Representation learning on static graphs Graph representation learning produces low-dimensional embeddings by modeling the topology and node attribute information. Early methods [9,10,26,27] generate sequences of nodes by random walks among neighbors and then learn node cooccurrences to obtain representations. Luo et al. [28] define ripple distance over ripple vectors to optimize the walking procedure. In order to integrate rich node attribute features while learning network structure information, the GNNbased approaches [2,4,5,[11][12][13]25] update node embeddings by aggregating neighborhood influence and propagating information across a multilayer network to capture the high-order patterns of the graph.
Focus on dealing with the heterogeneity, metapath2vec [9] and HIN2Vec [10] preserve heterogeneous information by designing semantic meta-paths, while heterogeneous GNNs [2,11,12] attempt to extend the message-passing procedure to handle different categories of information. Specifically, RGCN [2] introduces relation-specific transformations to encode features, HAN [12] designs hierarchical attention to describe node-level and semantic-level structures, HGT [11] uses meta-relation-based mutual attention to operate on heterogeneous graphs and learns implicit meta-paths. However, these methods cannot deal with temporal dynamics.
Representation learning on temporal graphs According to how temporal graphs are constructed, temporal graph representation learning methods can be divided into two categories: discrete-time methods, which describe the temporal graph as an ordered list of graph snapshots; continuous-time methods, which treat the temporal graph as an event stream with timestamps.
For the former, EvolveGCN [16] uses GCN to encode static graph structure and evolves the parameters of GCN by RNN. DySAT [17] uses structural attention to aggregate information from different neighbors in each snapshot and uses temporal attention to capture evolution over multiple snapshots. DyHATR [18] adopts hierarchical attention to learn heterogeneous information and applies RNNs with temporal attention to capture dependencies among snapshots. HTGNN [15] jointly models heterogeneous spatial and temporal dependencies through intra-relational, interrelational, and cross-temporal aggregation. ROLAND [29] proposes a framework to extend static GNN to dynamic graphs. Although discrete-time methods succeed in learning the dynamic patterns of temporal graphs, they ignore time information within the same snapshot and lead to weakened connections between graph snapshots.
Recent studies [19][20][21][22][23]30] have shown the superior performance of continuous-time methods in dealing with temporal graphs. JODIE [21] uses RNNs to propagate information in interactions and update node representations smoothly at different timesteps. TGAT [23] is designed as a GAT-like neural network, which propagates node information by sampling and aggregating historical neighbors, and learns high-order patterns by stacking multiple layers. TGN [24] proposes a general framework for encoding temporal graphs and captures long-term dependencies by preserving node states. CAW-N [22] proposes Causal Anonymous Walks (CAWs) to inductively represent a temporal graph and uses RNN to encode the walk sequences. These methods make full use of temporal information and model the dynamics of the graph without taking into account the heterogeneity. THINE [19] and HPGE [20] combine heterogeneous attention and Hawkes process to model graph heterogeneity and dynamics but do not consider the edge attributes.
Self-attention mechanism Transformer [31] proposed by Vaswani et al. for machine translation has achieved great success in NLP and CV tasks, which has recently been attempted for graph representation learning. For example, GTN [32] automatically generates useful meta-paths and learns new graph structures. Graphormer [33] generalizes positional encoding to the graph domain and uses scaled dotproduct attention for message passing. Transformer relies on the self-attention mechanism to learn contextual information for sequences. A scaled dot-product attention layer can be defined as: where Q denotes the 'queries', K the 'keys' and V the 'values'. They are the projections of the input Z on the matrices W Q , W K and W V , where Z contains the node embeddings and their positional embeddings.

Preliminaries
In this section, we introduce the definition of temporal heterogeneous graph and the problem of temporal heterogeneous graph representation learning.

Definition 1 Temporal Heterogeneous Graph. A temporal heterogeneous graph is
, where V denotes the set of nodes corresponding to a node type mapping function ∶ V → A , E denotes the temporal events (i.e., edges) corresponding to an event type mapping function ∶ E → R , and T denotes the set of timestamps. A and R are node type set and event type set, respectively, and |A| + |R| > 2.
Note that event e = (u, v, t, ) means that there is an edge from u to v at time t, where denotes the edge feature and r = (e) denotes the event type. Fig. 1b, the temporal heterogeneous graph about user-item interactions consists of 13 nodes, 17 events (smaller subscript of t i indicates that the event occurred earlier), two types of nodes, and three types of events. Specifi- r 1 denotes favorite, r 2 denotes browse, and r 3 denotes buy.
According to the line color, we know that (u 1 , For any node pair (u, v), a temporal causal path is a set of events consisting of u as the source node of the start event and v as the target node of the terminal event. Therefore, the temporal shortest path distance d t (u, v) is defined as the minimum length of the temporal causal path from u to v with all events on the path occurring no later than t. Denote V t as the set of nodes that appear up to time t, and for each node v ∈ V t , define its k-hop temporal neighbors as: , which is a subset of the temporal heterogeneous graph G and can be induced by N k t (v) . G k t (v) contains the source node v and its neighbors N k t (v) , events between the nodes, and timestamps of these temporal events. The final representation of node v will generate relying on G k t (v) . Notice that we use N t (v) and G t (v) to simplify the representation of N 1 t (v) and G 1 t (v) in this paper, respectively.

Definition 2
Temporal Heterogeneous Graph Representation Learning. Given a temporal heterogeneous graph G and the node features X, it aims to learn a mapping function F ∶ F(G, X) → ℝ |V|×d , where |V| is the node size and d is the dimension of embeddings, d ≪ |V|.
This mapping function maps nodes to low-dimensional vector space while preserving temporal, structural, and semantic information. For the sake of clarity, Table 1 summarizes the main notations used in this paper.

The Proposed Model
In this section, we present a Transformer-like graph attention architecture named THAN. It uses mapping matrices to project node embeddings into the same vector space, then passes neighborhood information by dot-product attention corresponding to different event types. Similar to GAT [5], THAN is designed as a local aggregation operator that captures high-order information by stacking multiple THAN layers. Figure 2 shows the architecture of the l-th THAN layer, which has three components: temporal heterogeneous neighbor sampling, dynamic embedding mapping and temporal heterogeneous graph attention layer. To capture longterm dependencies and evolutionary patterns, we design an optional memory module providing indirect access to distant neighbors, which dynamically updates the states of the nodes. After graph encoding, we use a heterogeneous graph decoder for the temporal link prediction task, which receives the node representations from THAN as inputs.

Temporal Heterogeneous Neighbor Sampling
For the purpose of improving the induction and generalization performance of the model, THAN does not select all but a certain number of neighbors from the temporal neighbors as the input. Given a node v 0 and time t, sample N neighbors from its 1-hop temporal neighbors N t (v 0 ) , denoted as {v 1 , ..., v N }. We discuss two neighbor sampling strategies: uniform random sampling, where all temporal neighbors are randomly selected with equal probability; top-N recent sampling, where the time difference with the source node is calculated and sorted in ascending order, then select the top N neighbors. Intuitively, recent interactions reflect the node's current state better than distant interactions and have a greater influence on future events. On the contrary, the distant interactions may introduce noise. Therefore, we use the top-N recent sampling strategy to sample neighbors.
In the temporal heterogeneous graph, the number of different-typed events varies greatly, which can easily lead to an unbalanced distribution of the types of sampled neighbors. To avoid sampling bias as far as possible, THAN limits the number of samples of each event type to no more than M. If the total number of event types related to the source node is ( ≤ |R|) , the total number of sampled neighbors N satisfies N ≤ * M.

Dynamic Embedding Mapping
For different nodes, TGAT [23] assumes that they are in the same feature distribution and share parameters of the Projection vector of event types n Projection vector of node types M Mapping matrix of meta relation Projection matrices of 'query', 'key' and 'value' The architecture of the l-th THAN layer for node u 0 at time t model, which does not hold in heterogeneous graphs. Furthermore, in the real world, there might be multiple types of edges between two nodes, and different types may correspond to different vector distributions. So we must consider not only the diversity of nodes, but also edges when propagating information.
A straightforward solution is mapping features in different distributions to the same semantic space by transfer matrices. However, as the number of types increases, more parameters will be introduced into the model. To reduce the number of training parameters as well as to avoid large-scale matrix multiplication calculations, inspired by TransD [34], THAN projects node features from the node-type space to the event-type space by dynamically computing the transfer matrix with two type-related vectors.

Example 3
Suppose that there are two types of nodes, and three types of events in a heterogeneous graph. If there are four dimensions of each node type and each event type, respectively, we have to define the transfer matrix for each mapping from node type to event type. If we parameterize the transfer matrices directly, then the number of parameters is 96 (i.e., 2 × 3 × 4 × 4 ). However, the number of parameters is only 20 (i.e., 2 × 4 + 3 × 4 ) if we use the projection vectors.
Given an event e = (u, v, t) with its meta relation ⟨ (u), (e), (v)⟩ [11], we define the dynamic mapping matrices as: where e and n denote the projection vectors of event types and node types, respectively, both of which are trainable. The projected node embeddings are: where x u (t) and x v (t) are the input embeddings of node u and v, respectively.

Temporal Heterogeneous Graph Attention Layer
Different events in a temporal heterogeneous graph may have different features, for example, in a question answering network, an answer interaction can be regarded as an event, and its features can be determined by the content. To enable event features to be propagated when aggregating information, THAN adds them to the node embeddings followed by a normalization layer (e.g., LayerNorm [35]). The event features will be resized to the same dimension as the node embeddings, and the output is: where i indicates the i-th neighbor, 0,i (t i ) denotes the feature of event between node v 0 and v i at time t i . Here, we set 0,0 (t) as zero vector. Transformer [31] uses positional encoding to model relative position relationships, thus solving the problem that the attention mechanism cannot capture the sequential relationships between entities. In temporal graphs, a functional time encoder [36,37] is usually used to map the time interval between nodes into a d T -dimensional vector in place of positional encoding. THAN uses a Bochner-type functional time encoding [23,37] as: where { i }s are learnable parameters. We merge the time embeddings with the node representations to obtain the node-temporal feature matrices as: where z e i 0 and z i denote the mapped embeddings of the source node v 0 and its neighbor v i corresponding to event e i , respectively, and ‖ denotes the 'concatenate' operation. Zs and Zn are forwarded to three different linear projections to obtain the 'query', 'key', and 'value': where e i denotes the event between v 0 and v i , W ∈ ℝ (d+d T )×d denote the projection matrices. Due to the edge heterogeneity, the projection matrices cannot be shared directly, thus we use matrices of different types to distinguish different events while capturing the semantics of events. The attention weight i is given by: and it reveals how v i attends to the feature of v 0 through event e i . In addition, not all types of events have the same contribution to the source node, so we set a learnable tensor ∈ ℝ |A|×|R| to adaptively adjust the scale of attention to different-typed events.
The self-attention aggregates the features of temporal neighbors and obtains the hidden representation for node v i as i V i , which can capture both node features and topological information. The next step is to map the representations back to the type-specific distribution of node v 0 so that they can be fused with the its features. We use a linear projection named Q-Linear to do this and the neighborhood representation is: To combine neighborhood representation with the source node feature, we concatenate and pass them to a feed-forward neural network as in TGAT [23]: Multi-head attention can effectively improve the model performance and stability. THAN can be easily extended to support a multi-head setup. Assuming the self-attention outputs come from P different heads, i.e., s i ≡ Attn i (Q, K, V) , i = 1, ..., P . We first concatenate these neighborhood representations with the source node's feature and then carry out the same procedure in Eq. 16 as: where h l 0 (t) ∈ ℝ d is the final output representation for node v 0 at time t, and it can be used for link prediction task with an encoder-decoder framework.

Memory Module
THAN stacks multiple networks to capture high-order patterns. However, as the number of layers increases, the cost of memory resources and training time grows exponentially. Moreover, constrained by the message-passing architecture, THAN cannot capture long-term features. To break this limitation, we devise a memory module (similar to TGN [24]) to save the historical states (i.e., memories) of the nodes. The states will be dynamically updated as events occur, thereby introducing long-term dependencies and indirectly accessing information from distant hops. Our experimental study, to be given in Sect. 5, demonstrates that the memory module costs less time than adding a THAN layer, but achieves similar or even superior performance. Figure 3 shows the standard computation process of THAN with memory module on a batch of training data. It encodes the input data and the latest memory by THAN and output the node representations. The memory module uses the representations to compute messages and update node memory. However, the memory module does not directly affect the loss. To address this problem, we save messages of nodes involved in current batch at the end of training and update the memory with messages from previous batch before graph embedding. The memory module consists of the following components: Memory Bank keeps the latest vector o i (t) for node v i at time t, which is initialized as a zero vector. Its memory is updated on the occurrence of each event involving the node.
Message Function is a learnable function to compute a message m i (t) for node v i as follows: where h i (t) is the representation from graph attention module, t − i is the time of the previous event involving node v i , msg(⋅) is the message function and we use FFN in this paper.

Fig. 3 Computation process of THAN with the memory module on a batch of events
In the case of an event e ij (t) between source node v i and target node v j , two messages (i.e., m i (t) and m j (t) ) can be computed. Different from TGN [24], we concatenate the node representation with the time embedding of the time span, while TGN considers both source node memory, target node memory, time embedding and edge features. Message Aggregator is an aggregation function to aggregate messages generated from message function component. In batch processing node v i may involve multiple events and each event corresponding to a message. These messages m i (t 1 ), ..., m i (t b ) are aggregated with the following formulation: where t 1 , ..., t b ≤ t , agg(⋅) can be optionally chosen as RNNs or attention networks. In this paper, we simply keep the latest message for a given node, which is considered from an efficiency perspective as it does not need to learn.
Memory Updater is used to update the memory of the nodes with the aggregated messages. Its formulation is: where update(⋅) is a learnable memory update function like LSTM [38] or GRU [39]. When an interaction event happens, the memories of both nodes it involves will be updated.

Heterogeneous Graph Decoder
Heterogeneous graph decoder aims to reconstruct heterogeneous edges of the graph relying on the node representations, in other words, it scores edge triples through a function H ∶ ℝ d × ℝ d r × ℝ d → ℝ , where d r denotes the dimension of edge type embeddings. We compute node representations through a l-layer THAN encoder and use a feed-forward neural network as the scoring function, thus an event (u, v, t) of type r can be scored as: where u and v denote the source and target node, respectively, r ∈ ℝ d r denotes the edge type embedding.
As in previous work [2,23], we train the model with negative sampling. For each observed example, we change the target node to construct a nonexistent event as a negative sample, so the number of positive samples is the same as that of negative samples. We optimize the cross-entropy loss as: where denotes the total set of positive and negative triples, denotes the logistic sigmoid function, y denotes the sample H(u, v, t, r))) + ‖ ‖ 2 2 label and takes the value of 1 for positive samples and 0 for negative samples, denotes the model parameters and controls the L2 regularization.

Experiments
In this section, we present the details of experiments including experimental settings and results. Firstly, we introduce the dataset, baselines, and parameter settings. Secondly, the performance comparisons are demonstrated in detail. Thirdly, we compare the effectiveness of different variants. Finally, we test the inductive capability of our proposed model.

Datasets
We evaluate our model on three public datasets: Movielens, Twitter, and MathOverflow. The statistics of these datasets are listed in Table 2.
• Movielens 1 is a dataset of user ratings of movies at different times collected from the MovieLens website. We select two types of nodes: user and movie. Regarding different ratings of movies as different types of events, a total of five types of events are obtained. • Twitter 2 collects public data on three types of relationships (retweet, reply, and mention) between users from the US social network Twitter. • MathOverflow 3 is from MathOverflow, a question and answer site for professional mathematicians. There are three relationships between users in this dataset: a user answered or commented on another user's question, and a user commented on an answer.

Baselines
To demonstrate the effectiveness, we compare THAN with ten popular graph representation learning methods, which can be divided into three groups: static graph embedding (DeepWalk [27], metapath2vec [9], GraphSAGE [25], GAT [5], RGCN [2], and HGT [11]), discrete-time dynamic graph embedding (DySAT [17] and DyHATR [18]), and continuous-time dynamic graph embedding (TGAT [23] and HPGE [20]). We use the implementations of static graph embedding methods provided in the PyTorch Geometric (PyG) package [40], and for other baselines, use the code submitted by the authors on GitHub. Besides, we ignore the heterogeneity for homogeneous methods and ignore the temporal information for static methods. For fairness, the same decoder declared in Sect. 4.5 is used for the downstream temporal link prediction task.
• DeepWalk and metapath2vec: They are random walkbased network embedding methods designed for static graphs. • GraphSAGE and GAT : They are two inductive GNN methods for static homogeneous graphs, aggregating and updating node representations in the message passing framework. • RGCN and HGT: They are two GNN methods for static heterogeneous graphs, where the former maintains a unique linear projection weight for each edge type while the latter uses mutual attention based on meta-relations to perform message passing on heterogeneous graphs. • DySAT: A discrete-time temporal graph embedding method and we split graph snapshots with the guidance in the paper. • DyHATR : A discrete-time THG embedding method that uses hierarchical attention to learn heterogeneous information and incorporates RNNs with temporal attention to capture evolutionary patterns. • TGAT : A continuous-time temporal graph embedding method that aggregates historical neighbors by self-attention to obtain node representations. • HPGE: A continuous-time THG embedding method that integrates the Hawkes process into graph embedding to capture the excitation of historical heterogeneous events to current events.

Parameter Settings
THAN was implemented in PyTorch. We split the training and test set as 8:2 according to time order. For a fair comparison, we use the default parameter settings of the baselines and set the embedding (i.e., node output embeddings, time embeddings, and event type embeddings) dimension d as 32, regularization weight as 0.01, and dropout rate as 0.1.
We employ Adam as the optimizer with a learning rate of 0.001. We randomly initialize the node vector if the dataset does not provide node features, and similarly, initialize the event features as zero vectors. For DeepWalk, metapath2vec, GraphSAGE, GAT, RGCN, and HGT, we set the max training epochs as 500 and use an early stopping strategy with the patience of 50. For DySAT and DyHATR, we split datasets into 10 snapshots. For our THAN, we set the event embedding dimension as 16, the number of layers as 2, attention heads as 4, epochs as 20 (30 for Movielens), learning rate as 0.001 (0.0001 for Twitter), batch size as 800 (500 for Movielens), and the number of samples for each type of neighbors as 10 (8 for Movielens). The implementation of THAN is publicly available. 4

Effectiveness Analysis
We conduct the temporal link prediction task to verify the effectiveness and efficiency, which asks if a type-r edge exists between two nodes at time t. We run all methods five times on three datasets and evaluate the average AUC (Area under the receiver operating characteristic curve) and AP (Average precision) scores. The overall results are shown in Table 3.
Obviously, THAN achieves the state-of-the-art performance in AUC metric on all three datasets. Although THAN does not outperform all other methods in AP metric, it also has a considerable performance (i.e., AP score achieves the SOTA result on Movielens and Twitter datasets and over 0.9 on MathOverflow dataset). For the Movielens dataset, dynamic graph embedding methods outperform the static graph embedding methods that ignore temporal information in both AUC and AP metrics, because the former learn temporal information in fine-grained contexts. Specifically, DySAT and DyHATR obtain performance improvements due to considering the changes of graph structure over time. TGAT, HPGE, and our THAN perform better than DySAT and DyHATR, this phenomenon shows that it is important to make full use of temporal information compared with simply preserving evolving structures between snapshots. For the other two datasets, the results of all methods show a similar trend, which means they probably have the same network patterns and temporal motifs. It may be related to the fact that they are both user activities datasets.
The GNN-based approaches achieve better performance than the random walk-based approaches since they capture much more useful information about the graph structure and the node features are utilized. The heterogeneous graph methods perform better than the homogeneous graph methods which indicates that integrating semantic information can benefit graph representation. In terms of AUC score, THAN improves performance by 3.38%, 2.29% and 8.1% compared to TGAT on three datasets, respectively. It demonstrates the effectiveness of our proposed model. In summary, our proposed approach works for two main reasons: (1) it effectively extracts structural features and fine-grained temporal information; (2) it reasonably handles heterogeneity in the process of message passing and aggregation.

Ablation Study
To demonstrate the effectiveness of each component in THAN, we conduct ablation experiments by removing/ replacing one specific component at a time. We rename them as: (1) THAN w/o time: removing time embeddings; (2) THAN w/o : removing event type attention weight; (3) THAN w/o Qlin: removing linear projection Q-Linear; (4) THAN r-uniform: using the uniform random sampling strategy instead of top-N recent sampling.
We report the results of the ablation study in Fig. 4, from which we have the following observations: (1) THAN outperforms the others with components removed in all metrics; (2) Time embedding plays an important role in temporal graph representation learning; (3) Setting different attention weights for different event types helps to learn heterogeneous semantic information; (4) More recent neighbors are more useful for extracting temporal dynamics and better reflect the current state of the source node; (5) It makes sense to keep the same feature space to fuse features from different nodes. Besides, it is noteworthy that removing the Q-Linear component did not have a significant impact on model performance on the Twitter and MathOverflow datasets, that is because both these datasets have only one type of node, and there is no need to consider the consistency of feature distribution across different types of nodes.

Parameter Sensitivity
To investigate the robustness of THAN and find the most suitable hyperparameters, we analyzed the effect of the number of neighbor samples and attention heads on three datasets shown in Fig. 5. For fairness, we select the number of neighbor samples from {4, 6, 8, 10} , the number of attention heads from {1, 2, 4, 6} , and the rest of the parameters remain the same as the experimental settings in Sect. 5.1.
On the one hand, Fig. 5a, b can lead to the following conclusion: the scores of AUC and AP improve as the number of neighbor samples increases, but on the Movielens dataset there is a decreasing trend instead, which may be caused by the dense connections between nodes. Sampling more neighbors may introduce more noise, resulting in smooth node representations. On the other hand, Fig. 5c, d shows that the number of attention heads affects the performance of the model. Multi-head attention helps to obtain different aspect representations from different subspaces, thus enhancing the expressiveness.

Effectiveness of Memory Module
In this section, we perform detailed studies on different instances of THAN focusing on the trade-off between accuracy and efficiency. The design of each variant is as follows: (1) THAN l1 : only using one graph attention layer as the graph encoder; (2) THAN † l1 : one-layer encoder with memory module; (3) THAN l2 : stacking two graph attention layers.  Table 4, we can see that stacking two layers helps obtaining good performance ( THAN l2 vs THAN l1 ), while the time costs increase by a factor of 8 to 21. Compared to adding a THAN layer, using memory module achieves similar or even superior model performance, but spends much less time ( THAN l2 takes about 13 times, 4 times and 19 times longer than THAN † l1 on three datasets, respectively). On the one hand, the number of neighbors that need to be aggregated increases exponentially by adding a layer. On the other hand, when accessing the memory of the source node and its 1-hop neighbors, THAN is introducing longterm dependencies and indirectly accessing information from distant hops.

Inductive Capability Analysis
We further discuss the inductive performance of THAN with the same settings as TGAT, i.e., mask 10% of the nodes from the training set and predict the existence of future events containing these masked nodes. In this paper, we choose GraphSAGE, GAT and TGAT as the comparison models. They are proposed as inductive representation learning methods on graphs, and their inductive capabilities are demonstrated experimentally. Experiments were conducted on three datasets and the results are shown in Table 5. Intuitively, THAN outperformed the baselines in two metrics on all datasets, which demonstrates the inductive capability of THAN.

Conclusion
Existing graph representation learning methods cannot well capture the information of temporal heterogeneous graphs. This paper proposes the THAN, which is a continuous-time temporal heterogeneous graph representation learning method. THAN uses dynamic transfer matrices to map different-typed nodes to the same feature space and aggregates neighborhood information based on the typeaware self-attention mechanism. To efficiently utilize temporal information, THAN uses a functional time encoder to generate time embeddings that are naturally integrated into the neighbor aggregation process. THAN is an inductive message-passing model based on historical neighbor sampling that not only captures temporal dynamics but also efficiently extracts topological features. In addition, we devise an optional memory module to store node states and capture long-term dependencies. It improves the model performance and takes less time than stacking a new THAN layer. The experimental results on three public datasets demonstrate that THAN outperforms the baselines on the temporal link prediction task. In future work, on one hand, we plan to explore the usage of THAN in various fields, such as recommender systems, social networks, and biological interaction networks. On the other hand, we try to understand the specific patterns/motifs of temporal heterogeneous networks from different domains. Furthermore, the large-scale temporal heterogeneous graph embedding is another direction worthy of further investigation. Availability of data and materials The datasets used in experiments can be downloaded from the following URLs: Movielens: https:// group lens. org/ datas ets/ movie lens/ 100k Twitter: http:// snap. stanf ord. edu/ data/ higgs-twitt er. html MathOverflow: http:// snap. stanf ord. edu/ data/ sx-matho verfl ow. html.

Conflicts of interest
We would like to submit the enclosed manuscript entitled "Memory-Enhanced Transformer for Representation Learning on Temporal Heterogeneous Graphs", which we wish to be considered for publication in Data Science and Engineering. No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.