Introduction

The proliferation of the internet and the widespread adoption of mobile devices have contributed to the exponential growth in the usage of social media networks, such as X (Twitter), Facebook, YouTube, and Reddit. These platforms collectively boast hundreds of millions of users worldwide. Owing to this vast user base and the copious amounts of data continuously posted and shared, social networks have emerged as invaluable data sources for diverse areas of research, including social recommendation [1], community detection [2], anomaly detection [3], and more. Analyzing and predicting trends within social media networks is an increasingly attractive topic [4]. Trend prediction refers to the process that uses historical trend data and current status data to help interpret the future estimated behavioral pattern. This research provides invaluable insights into the future trajectories of public concerns and interests grounded in the wealth of available social data. From a platform perspective, precise trend prediction holds great significance in enhancing user experiences, elevating service quality, and benefiting various applications.

Fig. 1
figure 1

Example of a social media network consists of users and items. Solid lines denote the interactions between users and items, and dash lines denote the interactions between user and item pairs. Interactions and entity attributes evolve as time goes by. The popularity trend of an item is represented by the visit frequency, which indicates the number of accesses from users (the degree of dashed lines)

Trend prediction in social media networks poses a formidable challenge because of its heterogeneity and temporal dynamics. As illustrated in Fig. 1, many social media networks exhibit heterogeneity, featuring various types of entities and multiple types of relations. These relations encompass connections within entities of the same type and connections between entities of different types. Assuming that the trend prediction task is to predict the visit frequency of items in the future, the representations of item entities are acquired only based on the entity attributes. However, it is necessary to consider the impact of information diffusion within entities, such as the interactions and messages exchanging within entities. Messages of social media information diffuse through the relations from one entity to another, exerting an impact on their future evolution. Thus, it is necessary to consider the impact of information diffusions within entities. Moreover, it is imperative to capture the temporal dynamics of entities at various timestamps. Social media networks are in a constant state of flux over time, which implies that entity attributes and relations undergo continuous transformations. Temporal dynamics are valuable for interpreting the change pattern of the predicted target.

Based on the analysis above, this work needs to solve the following research questions to address the challenging points mentioned above: RQ1: How to learn the relationships within entities in social media networks? RQ2: How to incorporate the temporal dynamics of entities and relationships into analyzing? RQ3: How to learn the dynamic representations of entities in social media networks? RQ4: how to quantify the degree of future popularity trends and predict it? RQ5: How to test the efficiency of our proposed method?

Several prior works have delved into the task of popularity trend prediction in social networks [5,6,7,8]. However, these studies predominantly focus on selecting characteristics and historical statistics within social media networks, often overlooking the substantial impact of information diffusion among entities. Furthermore, most of these prior works employ simplistic machine learning or statistical models (e.g., Support Vector Machine (SVM), Logistic Regression (LR), and more) as encoders and decoders, resulting in suboptimal learning efficiency. Consequently, there exists a pressing need for an advanced framework capable of considering the influence of information diffusion in social media networks while efficiently mastering the representations of target entities.

Graph embedding methods and Graph Neural Networks (GNN) are widely regarded as ideal approaches for modeling the influence of information diffusion in social media networks. Graph embedding methods transform complex graph attributes into low-dimensional embedding vectors while maximally preserving the essential graph structural information. These methods have consistently demonstrated superior performance compared to traditional techniques for modeling graph-structured data [9, 10]. GNNs, on the other hand, belong to the category of deep learning structures designed to perform optimized transformations on all graph attributes while preserving graph topology [11]. GNNs acquire node representations by iteratively aggregating information from neighbor nodes, mirroring the information flow dynamics seen in social media networks. Prior research has also highlighted the adaptability of GNNs in handling temporal graphs [12,13,14,15] and heterogeneous graphs [16,17,18,19], laying the foundation for learning in the context of social media networks.

In light of the aforementioned considerations, this paper introduces a novel approach employing a multi-layer temporal GNN to address the task of popularity trend prediction in social media networks. The method input is a timed sequence of graph snapshots demonstrating the dynamics of social media networks, and the output is an estimated popularity attribute (e.g., visit frequency, expected revenue, and so on) in the future, indicating the popularity trends of the target entities. The novelty of the proposed method is its unified structure that integrates a multi-layer GNN encoder and a time-sequence unit to effectively learn the temporal node representations by modeling the latent influence of heterogeneous relations and temporal dynamics within social media networks (Answer to RQ1, RQ2, and RQ3). Besides, a Graph Structure Learning (GSL) module is implemented to enhance the quality of input graphs, which is particularly essential when dealing with real-world graphs. Specifically, the evolution of social media networks is characterized as a discrete-time temporal graph comprising a timed sequence of graph snapshots separated by specific time intervals. Within each graph snapshot, node attributes and the intricate multi-layer graph structures are scrutinized to forecast how a specific target node status will evolve in subsequent graph snapshots (Answer to RQ4).

The proposed method is evaluated on four real-world popularity trend prediction tasks using benchmark data sets (Answer to RQ5). The experimental results consistently demonstrate the high efficiency of the proposed approach. The approach consistently outperforms various baseline methods, including traditional linear regression algorithms, time-sequence models, prior popularity trend prediction methods, and recently introduced heterogeneous GNN methods.

In this paper, we make several significant contributions to the field of trend prediction tasks in social media networks:

  • We introduce an advanced multi-layer temporal GNN framework designed to learn entity representations and predict trends within complex real-world social networks. It addresses the crucial task of forecasting trajectories of public concerns and interests based on extensive real-world data.

  • We address the limitations of previous research regarding popularity trend predictions, which often neglect the latent influence of complex relationships among entities. The proposed approach presents its novelty in considering the latent influence of information diffusion in social networks while efficiently mastering the representations of target entities, resulting in a substantial performance enhancement.

  • We demonstrate the effectiveness of the proposed approach through experiments conducted on real-world social network data sets, demonstrating the feasibility of the proposed method in real-world situations. The proposed method consistently outperforms various baseline methods, including linear regression approaches, time-series models, previous popularity trend prediction methods, and recently proposed heterogeneous GNN methods.

The remainder of the paper is organized as follows. In “Related work” Section summarizes the recent literature on graph embedding learning and trend prediction. In “Problem setup” Section introduces the problem statement of the popularity trend prediction task in social media networks. In “Methodology” Section presents the proposed multi-layer temporal GNN framework and its key components. In “Experiments” Section reports experimental results, evaluation, and additional discussions regarding parameter sensitivity, ablation study, and statistical test. In “Discussion and conclusion” Section concludes this research in the last.

Related work

Graph embedding learning

Graph embedding learning has achieved notable success in various domains [9, 10]. The techniques for generating embedding vectors on graphs have gained widespread recognition, particularly for downstream graph-related tasks such as node classification [20], link prediction [21], and graph classification [22]. The primary challenge in graph learning lies in discovering effective methods to encode graph structures, encompassing nodes and edges, into low-dimensional hidden embedding vectors while maximizing the preservation of essential graph structural information.

GNNs have emerged as a robust deep learning framework for graph embedding learning [11]. Among these, the graph convolutional network (GCN) [20] stands out as one of the most widely adopted models. GCN efficiently aggregates neighbor node information through graph convolutional layers. To further enhance GNN performance, numerous researchers have introduced advanced GNN models building upon the GCN architecture. For example, GraphSAGE extends GCN into inductive learning that can handle unknown graph nodes [23]. Graph attention networks employ attention mechanisms that assign importance weights to different neighbor nodes [24]. Graph autoencoder and variational graph autoencoder are GCN-based encoder–decoder models that can handle unsupervised learning tasks [25]. Moreover, ML-GCN and ML-GAT extend the original GCN and GAT to multi-layer networks, which can capture more complex relations among nodes within large-scale graphs [26].

Learning on temporal graphs presents considerably greater complexity compared to static graphs. Numerous studies have focused on discrete-time temporal graphs, which consist of a timed sequence of graph snapshots [12,13,14, 27]. Existing static graph methods can be directly applied to each individual graph snapshot. In parallel, substantial research efforts have been dedicated to graph embedding learning for heterogeneous graphs [16, 19]. Heterogeneous graphs encompass nodes and edges of multiple types. HetGNN combines heterogeneous structural information and node attributes. It performs excellently in graphs with multiple types of nodes and edges [18]. HAN is a heterogeneous GNN based on the hierarchical attention mechanism, including node-level and semantic-level attention [17]. It fully considers the importance of node neighbors and different meta-paths. This research endeavors to develop a unified framework that integrates the characteristics of discrete-time temporal graphs and multi-layer heterogeneous graphs.

Trends prediction tasks in real-world data

Time-aware trend prediction tasks have recently attracted much attention in both academia and industry [4,5,6]. Existing works about trend prediction can be categorized into two primary patterns: The first is to predict the growth and decline of entities based on past characteristics and early stage patterns. Yang et al. found that the temporal patterns reveal how the content popularity fluctuates during the post-propagation [28]. The second is to predict the value of specific target attributes based on temporal attributes or dynamic signals. Zhao et al. incorporated human reaction time as temporal variables in self-exciting point processes [29]. In addition, some hybrid methods between metaheuristics and machine learning lead to a novel research field, which successfully combines machine learning and swarm intelligence approaches and proved capable of obtaining outstanding results in different trend prediction areas [30,31,32].

Employing graph-based methods to tackle trend prediction tasks is a relatively recent research direction. He et al. designed a time-aware bipartite graph for estimating the future popularity of items [33]. Cao et al. developed a coupled GNN model to solve the popularity trend prediction task [7]. Li et al. adopted a graph kernel approach to predict the node popularity within a cascade graph sequence [8]. Hou et al. proposed a spatial–temporal multi-graph convolutional network for casualty prediction of terrorist attacks [34]. Yang et al. predicted traffic propagation flow in urban road networks with a multi-graph convolutional network model [35]. However, these methods did not account for multiple relations among various types of entities, limiting their applicability in complex real-world scenarios.

Fig. 2
figure 2

Problem setup of the popularity trend prediction in social networks. The social networks are presented as a sequence of graph snapshots \(\textbf{G}=\{\textbf{G}^{(1)}-\textbf{G}^{(t)}\} \). The proposed method first learns the node embeddings \(\textbf{Z}_\textbf{m}^{(1)}-\textbf{Z}_\textbf{m}^{(t)}\) of the target node type m by an encoder \(\mathcal {G}\), then predicts the node labels \(\mathbf {l_m}^{(2)}-\textbf{l}_\textbf{m}^{(t+1)}\) in the upcoming graph snapshots by a decoder \(\mathcal {F}\)

Problem setup

The temporal social network data are structured as a sequence of graph snapshots, each separated by a specific time interval. Many social networks exhibit heterogeneity, featuring various types of entities and multiple types of relations. These relations encompass connections within entities of the same type and connections between entities of different types. To represent such complicated networks, a sequence of graph snapshots \(\textbf{G}=\{\textbf{G}^{(t)}\mid t\in \{1, 2, \cdots , T\}\}\) is proposed, where each graph snapshot \(\textbf{G}^{(t)}=\{\textbf{V}, \textbf{E}, \textbf{F}\}\) contains node sets of different types \(\textbf{V}=\{\textbf{V}_\textbf{n}\mid n\in \{1, 2, \cdots , N\}\}\), edge sets \(\textbf{E}=\{\textbf{E}_{\textbf{jk}} \mid j,k\in \{1, 2, \cdots , N\}\}\) between two nodes in \(\textbf{V}_\textbf{j}\) and \(\mathbf {V_k}\), and node attribute matrices \(\textbf{F}=\{\textbf{F}_\textbf{n}\mid n\in \{1, 2, \cdots , N\}\}\) of each node set. Specifically, for edge set \(\textbf{E}_{\textbf{jk}}\), if \(j\ne k\), \(\textbf{E}_{\textbf{jk}}\) present the bipartite relations of node pairs between \(\textbf{V}_\textbf{j}\) and \(\textbf{V}_\textbf{k}\). If \(j=k\), \(\textbf{E}_{\textbf{jk}}\) present the intra-relations within node set \(\textbf{V}_\textbf{j}\)/\(\textbf{V}_\textbf{k}\).

The research problem is set up as demonstrated in Fig. 2. The task is to predict the popularity trend on entities of a certain target type m, quantified by a label \(\textbf{L}_m\). The proposed method first learns the embeddings \(\textbf{Z}_{m}^{(t)}\) of nodes in \(\textbf{V}_m\) of each graph snapshot \(\textbf{G}^{(\textbf{t})}\) by an encoder \(\mathcal {G}\). Specifically, node embedding \(\textbf{Z}_{m}^{(t)}\) is obtained based on the node attribute matrices \(\textbf{F}\) and edges \(\{\textbf{E}_{\textbf{mj}} \mid j\in \{1, 2, \cdots , N\}\}\) connecting the target nodes, as formulated in (1):

$$\begin{aligned} \mathcal {G}: (\textbf{V}, \{\textbf{E}_{\textbf{mj}} \mid j\in \{1, 2, \cdots , N\}\}, \textbf{F}) \rightarrow \textbf{Z}_{m}^{(t)}. \end{aligned}$$
(1)

Labels \(\textbf{l}_m^{(t)}\in \textbf{L}_m\) are assigned to nodes in \(\textbf{V}_m\) of each graph snapshot \(\textbf{G}^{\mathbf {(t)}}\). The label indicates the target value to be predicted. We aim to using learned node embeddings \(\textbf{Z}_{m}^{(t)}\) and known node labels \(\textbf{l}_m^{(t)}\) to predict the node labels \({\hat{\textbf{l}}}_m^{(t+1)}\) in the upcoming graph snapshot \(\textbf{G}^{(\mathbf {t+1})}\) by a decoder \(\mathcal {F}\), as formulated in (2):

$$\begin{aligned} \mathcal {F}: (\textbf{Z}_{m}^{(t)}, \textbf{l}_m^{(t)}) \rightarrow {\hat{\textbf{l}}}_{m}^{(t+1)}. \end{aligned}$$
(2)
Fig. 3
figure 3

Overview of the proposed multi-layer temporal GNN framework. Inputs are depicted as white blocks, main components are highlighted in blue, intermediate results are presented in yellow, and final outputs are shown in red. Intra graphs comprise the multiple relations within the target entities of the same type. The bipartite graph indicates the relationship between entities of different types

Methodology

Previous studies regarding popularity trend prediction tasks in social networks suffer from neglecting the latent influence of complex relationships among entities. To overcome the limitations in prior work, this work introduces an advanced multi-layer temporal GNN framework designed to effectively capture the intricate relations and learn the representation embeddings of target entities in social networks.

In the following, we first demonstrate the overall framework of the proposed method. Then, we explain the details of the key components. Finally, we analyze the computational complexity of the proposed methods.

Overview

Figure 3 illustrates an overview of the proposed method. The input is a timed sequence of graph snapshots \(\textbf{G}=\{\textbf{G}^{(\textbf{t})}\mid t\in \{1, 2, \cdots , T\}\}\). Each graph snapshot contains adjacency matrices \(\textbf{A}\), which is a tensor for expressing multiple adjacency matrices, along with a node attribute matrix \(\textbf{F}\). The input adjacency matrices encompass ‘intra graphs’ \(\textbf{A}_{\textbf{in}}\) and ‘bipartite graphs’ \(\textbf{A}_{\textbf{bi}}\). Intra graphs are homogeneous, signifying the multiple relations within the target entities of the same type, while bipartite graphs represent the relationship between entities of different types. Node attributes \(\textbf{F}\) are shared across all the graphs, with only edge weights varying between individual graphs.

This framework mainly has four key components: graph structure learning (gsl) module, node aggregation module, time-sequence unit, and multi-head attention layer. First, input graphs \(\textbf{A}\) are enhanced by the GSL module, which is employed to improve the quality of input adjacency matrices by a refined graph structure \(\hat{\textbf{A}}\) learned from the node embeddings \(\textbf{Z}^{(t-1)}\) of the last iteration. Next, the enhanced adjacency matrices \(\textbf{A}^{*}_{\textbf{in}}\) and \(\textbf{A}^{*}_{\textbf{bi}}\) are fed to Node Aggregation module to learn the hidden node embeddings \(\textbf{H}_{in}^{(t)}\) and \(\textbf{H}_{bi}^{(t)}\) by the information on neighbor nodes. Then, \(\textbf{H}_{in}^{(t)}\) and \(\textbf{H}_{bi}^{(t)}\) are sent to the time-sequence unit to incorporate the temporal information \(\textbf{H}_{in}^{*(t-1)}\) between the adjacent graph snapshots. Finally, a multi-head attention layer receives the merged hidden node embeddings \(\textbf{H}_{in}^{*(t)}\) and node attributes \(\textbf{F}\) for the learning of node embeddings. The learned node embeddings \(\textbf{Z}^{(t)}\) are responsible for predicting the node labels \(\hat{\textbf{l}}^{(t+1)}\), representing the target attribute for popularity trend prediction.

The details of key components are introduced in the following:

Graph structure learning

Noisy or incomplete graphs can often lead to suboptimal representations, hindering the efficient learning of node embeddings, especially when dealing with real-world graph data. GSL is designed to jointly learn an optimized graph structure and corresponding adjacency matrix [36]. Based on the assumption that edges tend to connect nodes with similar representation, the proposed method refines the input graphs by computing the similarity between the last updated node embedding pairs, which are then used as the edge weights.

Euclidean distance is no longer an effective metric for graph node similarity. Instead, a generalized adaptive Mahalanobis distance is employed to quantify node similarity, as formulated in (3):

$$\begin{aligned} \phi (\textbf{z}_i, \textbf{z}_j) = \sqrt{\left( \textbf{z}_i-\textbf{z}_j\right) ^T\textbf{W}_d\textbf{W}^T_d\left( \textbf{z}_i-\textbf{z}_j\right) }, \end{aligned}$$
(3)

where \(\textbf{z}_i, \textbf{z}_j\) denote the embedding of two node i and j learned in the last graph snapshot, and \(\textbf{W}_d\) is a trainable weight. \(\textbf{W}_d\textbf{W}^T_d\) constructs a symmetric positive semi-definite trainable parameter. Mahalanobis distance can adaptively fit the task and the node embeddings during training. Then, this distance is exploited to calculate the refined adjacency matrix by exploiting the Gaussian kernel:

$$\begin{aligned} \hat{\textbf{A}}_{\textbf{ij}} = \exp {\left( \frac{-\phi (\textbf{z}_i, \textbf{z}_j)}{2\sigma ^2}\right) }, \end{aligned}$$
(4)

where \(\hat{\textbf{A}}_{\textbf{ij}}\) denote the elements in the refined adjacency matrix \(\hat{\textbf{A}}\), and \(\sigma \) denote the size of Gaussian kernel.

After obtaining the refined graph adjacency matrix \(\hat{\textbf{A}}\), the residual connections [37] is used to incorporate the original input adjacency matrices \(\textbf{A}\) by a hyperparameter \(\alpha \), as formulated in (5):

$$\begin{aligned} \textbf{A}^{*} = \alpha \hat{\textbf{A}} + (1-\alpha )\textbf{A}, \end{aligned}$$
(5)

where \(\alpha \) is used to mediate the influence of each part, and \(\textbf{A}^{*}\) is the finally improved adjacency matrix.

Node aggregation

The core of GNN is recurrently aggregating information from neighbor nodes. The proposed method uses the graph convolutional layer [38] to aggregate the neighbor node information in intra graphs \(\mathbf {A_{in}^*}\), as formulated in (6):

$$\begin{aligned} \textbf{H}^{(l+1)} = \sigma \left( \tilde{\textbf{D}}^{-\frac{1}{2}}{\tilde{\textbf{A}_{\textbf{in}}^*}}\tilde{\textbf{D}}^{-\frac{1}{2}}\textbf{H}^{(l)} \textbf{W}_{in}^{(l+1)}\right) , \end{aligned}$$
(6)

where \({\tilde{\textbf{A}_{\textbf{in}}^*}} = {\textbf{A}_{\textbf{in}}^*} + \textbf{I}\), \(\tilde{\textbf{D}}\) is the degree diagonal matrix of \({\tilde{\textbf{A}_{\textbf{in}}*}}\), \(\textbf{H}^{(l)}\) is the node embedding matrix in lth layer, \(\textbf{H}^{(0)}\) is initialized with the node feature matrix \(\textbf{F}\), \(\textbf{W}_{in}^{(l)}\) is the trainable weight matrix in lth layer, and \(\sigma (\cdot )\) is the activation function. The result of the last graph convolutional layer is stored as the output of node aggregation on intra graphs, which is denoted as \(\textbf{H}_{in}^{(t)}\). t denotes the tth graph snapshot \(\textbf{G}^{(t)}\).

Specifically, for bipartite graph \(\textbf{A}^{*}_{\textbf{bi}}\) that models the relation between the target node set \(\textbf{V}_m\) and other types of node set \(\textbf{V}_k\), a dedicated bipartite GCN layer [39] is exploited to separately execute graph convolutional operations on \(\textbf{V}_m\) and \(\textbf{V}_k\), as formulated in (7) and (8):

$$\begin{aligned}&\textbf{H}_m^{\left( l+1 \right) } = \sigma \left( \left[ \textbf{D}_m^{-1} \textbf{B}_m \textbf{H}_k^{\left( l \right) } \textbf{W}_m^{\left( l+1 \right) } \parallel \textbf{F}_m \varvec{\omega }_j^{\left( l+1\right) }\right] \right) \end{aligned}$$
(7)
$$\begin{aligned}&\textbf{H}_k^{\left( l+1 \right) } = \sigma \left( \left[ \textbf{D}_k^{-1} \textbf{B}_k \textbf{H}_m^{\left( l \right) } \textbf{W}_k^{\left( l+1 \right) } \parallel \textbf{F}_k \varvec{\omega }_k^{\left( l+1\right) }\right] \right) , \end{aligned}$$
(8)

where \(\textbf{B}_m \in \mathcal {R}^{\Vert \textbf{V}_m\Vert \times \Vert \textbf{V}_k\Vert }\) and \(\textbf{B}_k \in \mathcal {R}^{\Vert \textbf{V}_k\Vert \times \Vert \textbf{V}_m\Vert }\) are the incidence matrix for two node sets \(\textbf{V}_m\) and \(\textbf{V}_k\) in the bipartite graph \(\mathbf {A^*_{bi}}\), respectively, \(\textbf{D}_m = \textrm{Diag}(\sum _i\textbf{B}_{m(1,i)},\dots ,\sum _i\)\( \textbf{B}_{m(\Vert \textbf{V}_m\Vert ,i)})\) and \(\textbf{D}_k =\textrm{Diag}(\sum _i\textbf{B}_{k(1,i)},\dots ,\sum _i \textbf{B}_{k(\Vert \textbf{V}_k\Vert ,i)})\) are the diagonal degree matrices of \(\textbf{V}_m\) and \(\textbf{V}_k\), \(\textbf{H}_m^{(l)}\) and \(\textbf{H}_k^{(l)}\) are the learned node embedding matrix in the lth layer. \(\textbf{W}_m^{(l)}\), \(\textbf{W}_k^{(l)}\), \(\varvec{\omega }_m^{(l)}\), and \(\varvec{\omega }_k^{(l)}\) are trainable parameters in the lth layer. \(\parallel \) is the concatenation operation.

The BiGCN layer discards the result \(\textbf{H}_k^{\left( l\right) }\) of other node set, and only outputs the result \(\textbf{H}_m^{\left( l\right) }\) on the target node set \(\textbf{V}_m\). The result is denoted as \(\textbf{H}_{bi}^{(t)}\), where t denotes the tth graph snapshot \(\textbf{G}^{(t)}\).

Time-sequence unit

Time-sequence Unit receives the most recently updated node embeddings from the previous graph snapshot and transmits these learned node embeddings to the subsequent graph snapshot. An LSTM cell [40] is implemented to achieve this function:

$$\begin{aligned} \textbf{h}_t, C_t&= {\text {LSTMCell}}(\textbf{X}_t, \textbf{h}_{t-1}, C_{t-1}) \end{aligned}$$
(9)
$$\begin{aligned} \textbf{H}^{*(t)}, C_t&= {\text {LSTMCell}}(\textbf{H}^{(t)}, \mathbf {H^*}^{(t-1)}, C_{t-1}). \end{aligned}$$
(10)

LSTM cell has three inputs and two outputs: input embedding \(\textbf{X}_t\), input hidden state \(\textbf{h}_{t-1}\), input cell state \(C_{t-1}\), output hidden state \(\textbf{h}_t\), and output cell state \(C_{t}\), as formulated in (9). Outputs from Node Aggregation module \(\textbf{H}_{in}^{(t)}\) and \(\textbf{H}_{bi}^{(t)}\) (denoted as \(\textbf{H}^{(t)}\) for convenience in the following) are fed into the LSTM cell as \(\textbf{X}_t\), and the updated hidden embeddings \(\textbf{H}^{*(t-1)}\) from last graph snapshot \(\textbf{G}^{(t-1)}\) are fed into LSTM cell as the input hidden state \(\textbf{h}_{t-1}\). The output hidden state \(\textbf{h}_t\) will be stored as the updated hidden embeddings \(\textbf{H}^{*(t)}\). \(\textbf{H}^{*(t)}\) and cell state \(C_t\) are passed to the following graph snapshot \(\textbf{G}^{(t+1)}\). The LSTM cells are implemented for each individual graph. The inputs and outputs are formulated in (10).

Multi-head attention layer

After the time-sequence Unit, a concatenation layer is exploited to merge the updated hidden embeddings from each LSTM cell as a single hidden embedding matrix, denoted as \(\textbf{H}^{*(t)}\). Then, a multi-head attention layer is exploited to obtain the node embeddings. Multi-head attention mechanism originated from the famous Transformer [41] and proved efficient in learning temporal node representation [42,43,44].

The multi-head attention layer plays an efficient role in aggregating neighbor node embeddings and node attributes. The layer operates by computing the dot product of a query vector with a set of key vectors, ultimately generating a weight vector that assigns importance scores to the corresponding value vectors. The mechanism is defined as follows:

$$\begin{aligned} \textbf{Q} =&\, \textbf{F}\textbf{W}_Q \end{aligned}$$
(11)
$$\begin{aligned} \textbf{K} =&\,\mathbf {H^*}^{(t)}\textbf{W}_K, \textbf{V} = \mathbf {H^*}^{(t)}\textbf{W}_V \end{aligned}$$
(12)
$$\begin{aligned} {\text {head}}_i =&\,\, {\text {Attn}}\left( \textbf{Q}_i, \textbf{K}_i, \textbf{V}_i\right) \end{aligned}$$
(13)
$$\begin{aligned} =&\,{\text {softmax}}\left( \frac{\textbf{Q}_i\textbf{K}_i^T}{\sqrt{d}} \right) \textbf{V}_i \end{aligned}$$
(14)
$$\begin{aligned} {\text {MultiHead}}&\left( \textbf{Q}, \textbf{K}, \textbf{V} \right) \nonumber \\ =&\,{\text {Concat}}({\text {head}}_1, {\text {head}}_2 \cdots , {\text {head}}_h)\textbf{W}. \end{aligned}$$
(15)

In the multi-head attention layer, the node hidden embedding \(\textbf{H}^{*(t)}\) is passed to the key and value parameters \(\textbf{K}\) and \(\textbf{V}\), and the node attribute matrix \(\textbf{F}\) is passed to the query parameter \(\textbf{Q}\). \(\textbf{W}_Q, \textbf{W}_K, \textbf{W}_V\) and \(\textbf{W}\) are trainable parameters. \({\text {Attn}}(\cdot )\) is a scaled dot-product attention encoder implemented by the softmax function, where d denotes the dimension of the query parameter. \({\text {MultiHead}}(\cdot )\) concatenates the attention score for each head into a single attention score matrix.

Besides, a skip-connection is exploited to add the output attention score matrix \({\text {MultiHead}}(\textbf{Q}, \textbf{K}, \textbf{V})\) and the last updated node embedding matrix \(\textbf{Z}^{(t-1)}\) together, which is passed from the last graph snapshot \(\textbf{G}^{(t-1)}\). The skip-connection is formulated in (16):

$$\begin{aligned} \textbf{Z}^{(t)}&= \textbf{Z}^{(t-1)} + {\text {MultiHead}}(\textbf{Q}, \textbf{K}, \textbf{V}) \end{aligned}$$
(16)
$$\begin{aligned}&= \textbf{Z}^{(t-1)} + {\text {MultiHead}}(\textbf{F}, \textbf{H}^{*(t)}, \textbf{H}^{*(t)}) \end{aligned}$$
(17)

The result of skip-connection \(\textbf{Z}^{(t)}\) is the final obtained node embedding matrix.

Loss function

The obtained node embedding \(\textbf{Z}^{(t)}\) is used to predict the node labels \(\hat{\textbf{l}}^{(t+1)}\) by exploiting an two-layer MLP decoder. The predicted node labels \(\hat{\textbf{l}}^{(t+1)}\) need to be as close as the target value of popularity trend prediction \(\textbf{l}^{(t+1)}\) in the next graph snapshot \(\textbf{G}^{(t+1)}\). Thus, it is a regression problem between the predicted value \(\hat{\textbf{l}}^{(t+1)}\) and actual value \(\textbf{l}^{(t+1)}\). The Mean Squared Error(MSE) loss is exploited to train the parameters:

$$\begin{aligned} \mathcal {L}_{{\text {reg}}}&= {\text {MSELoss}}\left( \hat{\textbf{l}}^{(t+1)}, \textbf{l}^{(t+1)}\right) \end{aligned}$$
(18)
$$\begin{aligned}&= \frac{1}{N}\sum _{i=1}^{N}\left( \hat{\textbf{l}}^{(t+1)}_i - \textbf{l}^{(t+1)}_i \right) , \end{aligned}$$
(19)

where N denotes the total number of nodes in the training set.

Besides, the learnable parameter in the GSL module also needs restraints to accelerate the training and increase the stability of the learned graph topology structure. An MSE loss is exploited to restrain the gap between the refined \(\hat{\textbf{A}}\) and the original \(\textbf{A}\) for each graph:

$$\begin{aligned} \mathcal {L}_{gsl}&= \sum ^{\Vert \textbf{A}\Vert } {\text {MSELoss}}\big (\hat{\textbf{A}}, \textbf{A}\big ) \end{aligned}$$
(20)
$$\begin{aligned}&= \sum ^{\Vert \textbf{A}\Vert } \frac{1}{N}\sum _{i=1}^{N}\big (\hat{\textbf{A}}_i - \textbf{A}_i \big ), \end{aligned}$$
(21)

where \({\Vert \textbf{A}\Vert }\) denotes the number of adjacency matrices in the tensor \(\textbf{A}\).

Finally, the total loss is

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{{\text {reg}}} + \lambda \mathcal {L}_{{\text {gsl}} }, \end{aligned}$$
(22)

where \(\lambda \) is a hyperparameter for balancing two losses.

Computational complexity

This section conducts a computational complexity analysis of the proposed method, with a particular focus on key components, including the GSL module, node aggregation module, time-sequence unit, and the multi-head attention layer. Regarding Eqs. (3), (4), (6)–(15), the complexity is primarily dominated by matrix multiplication. For convenience, this analysis assumes that the average node number of each node set is N, and both the node attributes \(\textbf{F}\) and node embeddings (\(\textbf{H}^{(t)}\) and \(\textbf{Z}^{(t)}\)) share the same hidden dimension D. The computational complexity is calculated as follows:

The complexity of the GSL module is \(\mathcal {O}(N^2D^2)\). The complexity of the graph convolutional layer in Node Aggregation is \(\mathcal {O}(N^2D+ND^2)\). The complexity of the LSTM cell is \(\mathcal {O}(ND)\). The complexity of the multi-head attention layer is \(\mathcal {O}(N^2D+ND^2)\). GSL module costs the most time complexity, followed by the node aggregation module and multi-head attention layer. Time-sequence unit costs the least time complexity. As a result, the total complexity of the proposed method is \(\mathcal {O}(N^2D^2) + \mathcal {O}(N^2D+ND^2) + \mathcal {O}(ND) + \mathcal {O}(ND^2) + \mathcal {O}(N^2D+ND^2) = \mathcal {O}(ND(ND+2N+3D+1)) = \mathcal {O}(N^2D^2)\).

Experiments

In this section, the effectiveness of the proposed method is validated in real-world popularity trend prediction tasks. Experiments are conducted on four social media network data sets and evaluate the performance of the proposed method against baseline methods. Additionally, the sensitivity of the model hyperparameters is evaluated. Furthermore, an ablation study is performed to assess the significance of the key modules within the proposed method.

Data set description

The real-world social media network data sets used in the experiments are listed in Table 1.

Table 1 Statistics of the social media network data sets used in our experiments
  • YouTube live streaming data setFootnote 1 [45]: this public data set comprises information on 1,358 live streaming channels and more than 136,000 viewers, spanning from April 2021 to July 2022. The attributes of both channels and viewers are represented as vectors. Relations within channels include the rate of common viewers, the conflict in streaming time, and the similarity of streaming content. Relations between channels and viewers present the interaction of donations and the amount of income. The data set is separated into several graph snapshots, each separated by a 1-month interval. The objective of the popularity trend prediction task within this data set is to forecast the donation income for each channel in the upcoming months.

  • MOOC students and coursesFootnote 2 [46]: this public data set encompasses interactions performed by students on MOOC (Massive Open Online Course) platform, involving 7047 users engaging with 98 courses, resulting in a total of 411,749 interactions. Both student and course attributes are represented as feature vectors. The data set is separated into several graph snapshots, each separated by a 2-day interval. The objective of the popularity trend prediction task within this data set is to predict the frequency of visits to each course in the forthcoming days.

  • Reddit post data setFootnote 3 [46]: this public data set comprises one month of posts from 10,000 active users on the 1000 most active topics on the Reddit community forum, resulting in 672,447 interactions. The text of each post is converted into feature vectors. The data set is separated into several graph snapshots, each separated by a 2-day interval. The objective of the popularity trend prediction task within this data set is to predict the post numbers on each topic in the upcoming days.

  • Wikipedia editsFootnote 4 [46]: this public data set encompasses one month of edits made to Wikipedia pages, comprising the 1000 most frequently edited pages. It involves 8227 editors and a total of 157,474 edit records. Like the Reddit data set, the edit text is transformed into feature vectors. The data set is separated into several graph snapshots, each separated by a 2-day interval. The objective of the popularity trend prediction task within this data set is to predict the frequency of edits on each page in the forthcoming days.

Fig. 4
figure 4

The box plot of the target values (true labels) in the YouTube live streaming data set. The target values are distributed from 1 to more than 10 million. “Month” and “Week” denotes the interval of graph snapshots (1-month interval and 1-week interval)

Fig. 5
figure 5

The box plot of the improved percentage of change on target values (true labels) in YouTube live streaming data set. The percentages are distributed from \(-1\) to 2 after eliminating outliers. “Month” and “Week” denotes the interval of graph snapshots (1-month interval and 1-week interval)

This work aims to predict values that reflect the popularity degree of target entities. These values correspond to the amount of income (YouTube live streaming data set), the number of visits (MOOC and Reddit data sets), and the number of edits (Wikipedia data set). However, these values often vary across different magnitudes, ranging from 1 to more than 10 million in some cases. For example, the distribution of target values in the YouTube live streaming data set is illustrated in Fig. 4. Predicting target values across such a large and imbalanced distribution of magnitudes poses a challenge for the proposed model, potentially resulting in significant biases in the prediction results.

To address this challenge, we implement a transformation on the prediction target values, converting them into percentages of changes. Thus, the proposed model predicts the percentage of change in the popularity degree values rather than the original values. As depicted in Fig 5, the percentages only range from \(-1\) to 2 after eliminating outliers. Consequently, the proposed model can provide more accurate predictions within this smaller range. This approach allows us to mitigate the negative impact stemming from the vast magnitudes of target values, which are influenced by confounding effects.

Setup of the experiment

The data sets introduced above provide the intra graphs of item entities, bipartite graphs representing user-item relationships, and feature vectors for the input. Specifically, in cases where the data sets solely comprise bipartite graphs of user-item pairs, the intra graph is constructed based on the two-hop neighborhood of item entities. It means that item entities are interconnected in the intra graph if they share connections with the same user entity in the bipartite graph.

Experiments are executed on a platform with Intel Xeon Platinum 8360Y CPU and NVIDIA A100 for NVLink 40GiB HBM2 GPU. The standard model hyperparameters are fine-tuned, including the interval of graph snapshots, learning rate, epochs, and dropout rate. Regarding the data set configuration, we test different choices of the interval of graph snapshots for each data set and finally determine the choice with the best performance: a 1-month interval for the YouTube live streaming data set and a 2-day interval for the MOOC, Reddit, and Wikipedia data sets. During training, the Adam optimizer is employed with a learning rate set to 0.01. The model is trained for 30 epochs, including training, validation, and testing phases. A dropout rate of 0.2 is applied. The multi-head attention layer configures two attention heads. Furthermore, an early stopping strategy is implemented, which halts training if the validation loss fails to decrease for five consecutive iterations. Additionally, sensitivity experiments are performed for specific hyperparameters through grid search, as explained later.

All the configurations above are determined by manual grid search. Regarding the sensitivity analyses of the configurations above, the interval of graph snapshots severely impacts the model performances because the information in graph snapshots is totally different with distinct intervals. In comparison, the learning rate and the dropout rate are less sensitive to the experimental results. Finally, as introduced above, the number of attention heads in the multi-head attention layer significantly impacts the computational costs.

Baselines

Table 2 Input information required by the baseline models

The following baseline methods are included for comparison in experiments, including basic linear regression approaches, time-sequence models, previous popularity trend prediction methods, and recently proposed heterogeneous GNN methods:

  1. 1.

    Gradient Boost Regressor (GBR): A gradient boost regressor from the scikit-learn toolkit.

  2. 2.

    XGBoost: An ensemble gradient boosting decision tree model from XGBoost library.

  3. 3.

    Multi-layer Perceptron (MLP): A basic two-layer fully connection neural network.

  4. 4.

    LSTM-FCNFootnote 5 [47]: A time-sequence deep learning model combining LSTM and a fully convolutional network.

  5. 5.

    HetGNNFootnote 6 [18]: A GNN model for learning node embedding representations in heterogeneous graphs.

  6. 6.

    HANFootnote 7 [17]: A heterogeneous graph attention network based on the hierarchical node-level and semantic-level attentions.

  7. 7.

    HGTFootnote 8 [48]: A heterogeneous graph transformer architecture that can deal with large-scale heterogeneous and temporal graphs.

  8. 8.

    CoupledGNNFootnote 9 [7]: A model solves the network-aware popularity prediction problem, capturing the cascading effect explicitly by two coupled GNNs.

Table 2 presents the required input information for each baseline method. The default parameter settings are maintained during training and testing for all baseline methods. The specific experimental configurations are as follows:

  • Linear regression algorithms (GBR and XGBoost) and the Neural Network (MLP) receive every feature vector and learn to predict the target value.

  • Time-sequence models (LSTM–FCN) handle the time-sequence data, taking in sequences of feature vectors as input. These models generate a sequence of hidden state embeddings, which are subsequently used for predicting the target value.

  • Static heterogeneous graph methods (HetGNN and HAN) are typically applied to model graph-structured data containing multiple types of entities and relations. In the case of this work, the user-item relations in the data sets can be viewed as a unique heterogeneous graph. These models receive feature vectors and the graph structure information as input.

  • Temporal heterogeneous graph methods (HGT and CoupledGNN) operate on dynamic heterogeneous graphs encompassing node attributes, graph structure, and temporal information. These models are considered the strongest baseline methods compared to the proposed method.

Evaluation

Table 3 The results of node attribute value prediction tasks on four data sets

Table 3 and Fig. 6 present the experimental results for the popularity trend prediction tasks. The performance evaluation is based on the Rooted Mean Squared Error (RMSE) score and the Mean Absolute Percentage Error (MAPE):

$$\begin{aligned} {\text {RMSE}}(y, \hat{y})&= \sqrt{\frac{\sum ^{N}_{i=i}{(y_i - \hat{y_i})^2}}{N}} \end{aligned}$$
(23)
$$\begin{aligned} {\text {MAPE}}(y, \hat{y})&= \frac{1}{N}\sum ^{N}_{i=i}\frac{\Vert y_i - \hat{y_i}\Vert }{\Vert y_i\Vert }, \end{aligned}$$
(24)

which quantify the average difference between predicted values \(\hat{y}\) and actual values y of total N samples. Smaller RMSE and MAPE indicate superior performance. The proposed method has the best overall performance, achieving the lowest RMSE and MAPE scores on all four data sets, as indicated by bold font in the table. The respective performance of the baselines is evaluated as follows:

  • Linear regression algorithms (GBR and XGBoost) and the Neural Network (MLP) rely solely on the attribute vectors of items, which is often impractical for predicting trends.

  • Time-sequence models (LSTM–FCN) take in sequences of attribute vectors and generate corresponding sequences of hidden state embeddings for each graph snapshot. However, these models are susceptible to input noise in the early steps, which can significantly impact subsequent outputs. Consequently, relying solely on long sequences of attribute vectors may not perform as well. In some cases, they may even underperform compared to linear regression algorithms that only use attribute vectors.

  • Static heterogeneous GNN methods (HetGNN and HAN) are good at analyzing graph structural information and edge interactions among different types of nodes. These methods outperform linear regression methods and time-sequence models. However, they are not designed to handle temporal data, a crucial aspect of popularity trend prediction.

  • Temporal heterogeneous GNN methods (HGT and CoupledGNN) receive inputs similar to the proposed method. Their performance depends on their underlying architectures and ability to effectively capture temporal and graph structure information. The proposed method outperforms HGT and CoupledGNN, owing to distinctive architectural choices such as incorporating GSL and multi-head attention layer.

Fig. 6
figure 6

The box plot of the results. Four rows denote four data sets and two columns denote two metrics

Fig. 7
figure 7

Study of parameter sensitivity

Parameter sensitivity

In this section, an experiment is conducted to determine the optimal values for significant hyperparameters, namely the hyperparameter \(\alpha \) in GSL, the balance hyperparameter \(\lambda \) in the loss function, and the dimension of hidden embeddings. These hyperparameters are fine-tuned through grid search within specific ranges.

In Fig. 7, the performance on four data sets is demonstrated concerning each hyperparameter, using the RMSE score as the metric. The rows represent the results for four data sets, while the columns correspond to values of \(\alpha \) ranging from 0.01 to 0.9, \(\lambda \) ranging from 0.01 to 1.0, and hidden dimensions varying from 16 to 256.

Based on the results, the proposed method is not highly parameter-sensitive. The relatively optimal hyperparameter choices are \(\alpha =0.1\), \(\lambda =0.1\), and a hidden dimension of 128. These settings are employed in the experimental results for the popularity trend prediction tasks in the previous evaluation section.

Ablation study

An ablation study is conducted to assess the significance of key components in the proposed method, namely GSL, intra graphs, bipartite graphs, and the time-sequence unit. It is interesting to discover what component contributes the most to different application cases in experiments. Each of these components is responsible for specific types of input information:

  • GSL enhances the quality of input graphs.

  • Intra graphs represent intra-relations within target entities.

  • Bipartite graphs depict relations between target entities and other types of entities.

  • The time-sequence unit receives and propagates temporal information between adjacent graph snapshots.

Table 4 The ablation study results without specific key components in the proposed method

In Table 4, the performance is demonstrated when each component is omitted, with the metric being the RMSE score. The results without specific components consistently underperform compared to the completed method, highlighting the indispensability of all components to the overall performance of the proposed method. The worst results in each data set are marked in bold font. These results indicate that the corresponding component is the most significant for various data sets. To provide specific insights:

  • In the YouTube data set, the method without GSL exhibits the worst performance, highlighting the importance of improving input graph quality.

  • In the MOOC data set, the absence of bipartite graph input leads to the poorest result, emphasizing the significance of item-user relations.

  • In the Reddit data set, all results are equally achieved, suggesting that no single component stands out as the most crucial.

  • In the Wikipedia data set, the most crucial component is the time-sequence unit, highlighting the valuable information in the edit history.

It is reasonable that the most significant components vary across different data sets, as the most abundant information differs in each case. In summary, those findings emphasize that all components are essential, and their significance depends on the specific application scenarios.

Table 5 The ranks of nine methods assessed on eight tests

Statistical test

This section introduces a statistical test to determine the statistical significance of observed differences among baseline methods in “Evaluation” Section. The test involves formulating a null hypothesis \(H_0\) and an alternative hypothesis \(H_1\) as follows:

Hypothesis 0.:

All the methods have the same performance.

Hypothesis 1.:

The performance of methods has significant differences.

Next, we employ two statistical tests, the Friedman test and the Wilcoxon sign-rank test, to assess the differences between the proposed method and baseline methods and validate the formulated hypotheses.

Table 6 The results of the Wilcoxon sign-rank test between pairs of the proposed method and each baseline method

Friedman test

The Friedman test, a nonparametric statistical method, is employed to identify significant differences in the performance of two or more methods across multiple test attempts [49, 50]. The Friedman test is the analog of the repeated measures Analysis of Variance (ANOVA) in nonparametric statistical procedures.

The initial step in calculating the Friedman test statistic involves converting the original results in Table 3 into ranks. Specifically, we assess the performance of nine methods (eight baseline methods and one proposed method) on four data sets (YouTube, MOOC, Reddit, and Wikipedia) using two metrics (RMSE score and MAPE score), resulting in a total of eight tests. The results are ranked for each test, ranging from 1 (indicating the best result) to 9 (representing the worst result). Subsequently, the average rank for each method is calculated based on all eight tests. Table 5 illustrates the ranks of the nine methods across the eight tests, with tied ranks assigned an average value. By using these ranks, the Friedman statistic can be computed as follows:

$$\begin{aligned} \chi ^2_f = \frac{12n}{k(k+1)}\left[ \sum _j{R_j^2} - \frac{k(k+1)^2}{4} \right] , \end{aligned}$$
(25)

which follows a \(\chi ^2\) distribution with \(k-1\) degrees of freedom. Here, n represents the number of tasks, k is the count of baseline methods included in the comparison, and \(R_j\) denotes the average rank of each method j.

Iman and Davenport propose a derivation from the Friedman statistic [51] as follows:

$$\begin{aligned} F_{ID} = \frac{(n-1)\chi ^2_f}{n(k-1)-\chi ^2_f}, \end{aligned}$$
(26)

which follows an F distribution with \(k-1\) and \((k-1)(n-1)\) degrees of freedom.

Based on the formula above, the test statistic \(F_{ID}\) is calculated as 52.534, and the corresponding p value is \(1.943\times 10^{-9}\). As the p value is less than \(\alpha =0.05\), we reject the null hypothesis \(H_0\), which indicates that all baseline methods exhibit the same performance. In other words, there is substantial evidence to support the presence of statistical significance among these methods.

Wilcoxon sign-rank test

While the Friedman test excels at detecting overall differences across multiple comparisons, its limitation lies in its inability to pinpoint significant differences within specific pairs of methods. To address this constraint, the Wilcoxon sign-rank test emerges as a non-parametric alternative to the paired t test, particularly suited for non-normally distributed samples in this work.

In the Wilcoxon sign-rank test, the differences between sample averages for all method pairs are calculated in multiple comparisons. The test is defined as follows:

$$\begin{aligned} R^+&= \sum _{d_i > 0}{{\text {rank}}(d_i)} + \frac{1}{2} \sum _{d_i = 0}{{\text {rank}}(d_i)} \end{aligned}$$
(27)
$$\begin{aligned} R^-&= \sum _{d_i < 0}{{\text {rank}}(d_i)} + \frac{1}{2} \sum _{d_i = 0}{{\text {rank}}(d_i)}, \end{aligned}$$
(28)

where \(d_i\) represents the difference between the performance scores of two methods on the ith out of n tests. \(R^+\) is the sum of ranks for tests where the first algorithm outperformed the second, and \(R^-\) is the sum of ranks for the opposite. Ranks of \(d_i = 0\), indicating ties, are evenly distributed between the sums. \(T = \min (R^+, R^-)\) is the smaller of the sums. If T is less than the critical value from the Wilcoxon distribution for n degrees of freedom, the null hypothesis \(H_0\) is rejected, signifying that the given method outperforms the other, with associated p values.

In this study, we employ the Python library scikit-posthocsFootnote 10 to conduct the Wilcoxon sign-rank test and calculate p values for each pairwise comparison between the proposed method and baseline methods. Similar to the Friedman test, this assessment involves ranking the performance of nine methods across eight tests.

To address the issue of inflated Type I error (familywise error rate) in multiple comparisons, the test incorporates adjusted p values using the Benjamini–Hochberg false discovery rate (FDR) method [52]. The results, including the calculated adjusted p values, are presented in Table 6.

All adjusted p values for the proposed method versus baseline methods are below the significance level of \(\alpha =0.05\), indicating statistical significance compared to these methods. It is noticeable that all the adjusted p values are all the same because the proposed method consistently achieves a rank of 1 across all eight tests, resulting in T always equaling 0.

Conclusions and discussion

Discussion

In this section, we discuss the improvements achieved by comparing our method to the most recent and influential studies concerning GNNs in real-world prediction tasks. In contrast to other state-of-the-art heterogeneous GNNs, such as HetGNN [18], HAN [17], HGT [48], and CoupledGNN [7], our method introduces a specialized multi-layer architecture. This architecture decomposes the intricate multiple relationship types into two categories: intra connections within entities of the same type and inter connections across entities of different types. This simplification process proves advantageous for handling heterogeneous graphs. Moreover, our method incorporates several modules to enhance overall performance, including the GSL module for improving the quality of input graphs and a multi-head attention layer for efficiently combining outputs from each graph layer.

In comparison to recent trend prediction studies unrelated to graphs, such as those in citation networks [8], terrorist attacks [34], and traffic flow [35], these methods often focus primarily on selecting characteristics and statistical features of target entities. They tend to overlook information diffusions along relationships and commonly rely on simplistic machine learning or statistical methods, resulting in suboptimal learning quality. To address these shortcomings, our method emphasizes the “message passing” along relationships within entities and employs advanced deep-learning-based encoders and decoders, leading to high learning efficiency.

In summary, our method addresses real-world trend prediction tasks and demonstrates remarkable improvements compared to the most recent and influential studies.

Conclusion

This work provides an effective solution for learning the representation of entities within social media networks and predicting trends in real-world scenarios. To achieve this, we develop an advanced multi-layer temporal GNN framework that captures information diffusion and temporal dynamics among different types of entities. The experimental results demonstrate the efficiency of the proposed method compared to various baseline methods, including basic linear regression approaches, time-sequence models, previous popularity trend prediction methods, and recently proposed heterogeneous GNN methods. Besides, massive experiments are conducted to explore the optimal choices of hyperparameters and assess the significance of key components in the proposed method.

In addition, this work makes a substantial contribution to the rapidly growing and promising field of trend prediction in social media. Social media is a primary source for obtaining information about emerging trends worldwide. By predicting trends in social media, the proposed method can assist users in gaining a deeper understanding of the future trajectories of public concerns and interests. Anticipating trends in advance holds significant value, as it enables individuals and organizations to stay ahead of the competition, align their studies and work in the right direction, and seize opportunities for informed decision-making in the future.

Future work

In terms of the limitations of this work, we acknowledge that the proposed method only supports input in the form of sequences of graph snapshots. The proposed method cannot handle other types of temporal graphs, such as sequences of graph actions accompanied by timestamps. Furthermore, real-world situations are highly dynamic and influenced by numerous factors. It is important to note that no one can accurately predict the future without any uncertainties or missing information. Our method is intended to provide results for reference purposes only. Any negative consequences resulting from the misuse of the method are not addressed in this work.