Introduction

Graph representation learning is a technique utilized to encode graph data into low-dimensional vectors for various downstream tasks, including time series representation, forecasting, and traffic prediction. One popular approach in this field is graph contrastive learning (GCL), which has attracted much attentions due to its capabilities in unsupervised learning, robustness, and generalization. To further enhance the performance of GCL models, graph augmentation techniques are commonly used. These techniques introduce noise and provide additional training samples, thereby improving the model’s robustness against noisy or incomplete data. Designing effective augmentation schemes for graphs is challenging, as graphs often represent diverse attributes of the original data [1]. GCA (Graph Contrastive Augmentation) [2] aims to overcome this challenge by incorporating the notion of centrality, which considers the semantic influences between nodes and edges. However, a limitation of centrality measures is that they are task-specific and cannot be learned from data alone, as they rely on prior domain knowledge.

With the advancements in contrastive learning techniques, various contrastive learning-based models have been developed. ST-TSNet [3] introduces stochastic argumentation to capture local and spatial dependencies and effectively explore deep spatio-temporal representations. SSGNN [4] leverages both original and mask-augmented data in the model. It combines self-distillation techniques with self-supervised learning tasks to enhance the capture of spatio-temporal features in graph data. JOAO [5] focuses on optimizing the combination of augmentations without explicitly optimizing the augmentations themselves. By jointly optimizing the augmentation strategies, JOAO improves the performance of contrastive learning models. AutoSSL [6] concentrates on optimizing the weights for combined self-supervised tasks. It automates the process of selecting the most effective tasks for self-supervised learning, thereby enhancing the overall performance of the model.

In this paper, we present JDAGCL (Joint Data Augmentations for Automated Graph Contrastive Learning), a graph contrastive learning framework. Our work makes the following key contributions:

  • We introduce two augmenters that generate appropriate views of the original graph, taking into account both the topology and features of the graph. These augmenters enhance the learning process by providing diverse perspectives and enabling the encoder to capture essential information from different aspects of the graph.

  • We introduce a novel joint training strategy that guarantees semantic consistency and improve the forecasting performance.

  • We conduct comprehensive experiments on four real-world datasets to evaluate the effectiveness of our approach. The experimental outcomes demonstrate that our method consistently outperforms existing techniques, yielding superior forecasting results.

Related work

Graph contrastive learning

Contrastive learning has gained significant attention in various domains in recent years [7, 8]. In image-based contrastive learning, effective contrastive samples can be obtained by selecting augmentation techniques that preserve the intrinsic semantics of the images, leveraging human understanding of image semantics. However, graph data presents greater abstraction and complexity compared to images, making it challenging to ensure that operations on the graph do not disrupt its fundamental semantics [9]. Consequently, selecting appropriate contrastive samples in graph contrastive learning becomes a significant challenge. To address this challenge, existing research has explored different approaches to leverage various aspects of the graph as contrast samples. For example, in the work by [10], the triplet loss is introduced to train a biased encoder that prioritizes easy negative samples, enhancing the learning process by focusing on informative contrastive samples. HSAN (Hardness-based Self-Adaptive Negative Sampling) [11] incorporates both attribute embedding and structure embedding to compute the similarity between samples. This approach provides a more comprehensive representation of the relationships between samples, facilitating the measurement of sample hardness and enabling the selection of more challenging contrastive samples.

Data augmentation on graph

Data augmentation plays an important role in graph contrastive learning as it enhances the diversity of training data by applying various transformations and extensions to graph data. This augmentation process makes the model acquire more robust graph representations and effectively handle variations in input data [9, 12]. Many augmentation techniques have been explored in the context of graph contrastive learning. These techniques include node dropping [13, 14], edge perturbation [15], and subgraph extraction [16,17,18]. Each technique employs specific changes to the original graph data to optimize the following learning procedure. GraphCL [1] extensively investigates various combinations of data augmentation to identify the optimal augmentation strategy for specific downstream tasks. By exploring different augmentation techniques and their combinations, GraphCL aims to maximize the benefits of data augmentation for graph contrastive learning. SSGNN [4] incorporates both original and mask-augmented data into the model. It adopts the self-distillation method and self-supervised learning task to enhance the model. AD-GCL [19] employs an automated edge dropping strategy using data augmentation to enhance the model performance. However, determining the appropriate augmentation rate to ensure semantic lower bounds remains challenge.

Performance analysis for downstream task

There are some existing neural network methods concerning the downstream task such as forecasting and classification. MRENN [20] updates the model parameters using the Lyapunov stability criterion and establishes a recursive learning rate mechanism to expedite the learning process. The work in [21] proposes a taxonomy of graph data augmentation techniques and subsequently offer a structured review by categorizing related work according to augmented information modalities. Hybrid-GTS [22] integrates time series and geographic factors to enhance prediction accuracy. It constructs a geometric graph using node locations and leverages a probabilistic graph structure learned from node embeddings to capture nonlinear temporal dynamics. The work in [23] proposes a diagonal recurrent neural network for adaptive control of nonlinear dynamic systems. Gash-LKH [24] proposes a graph-based learning framework that combines sparse graph neural network with Lin-Kernighan heuristic solver.

Overall, we note that although significant progress has been made, existing methods still have some limitations, including the reliance on domain knowledge for data enhancement methods and influence of data incompleteness. These limitations motivate us to propose the JDAGCL framework. By introducing joint training strategy and automated graph contrast learning methods, we aim to overcome the limitations of existing methods and provide a more flexible, adaptive and superior performance solution.

Fig. 1
figure 1

The overall architecture of the proposed JDAGCL model

Methodology

Figure 1 illustrates the overall architecture of the model. We will describe the three major components of our methods, including learnable data augmentation, joint training strategy and encoder.

Problem definition

This work mainly deals with forecasting tasks based on our augmentation and contrastive learning. Formally, an unlabeled attribute graph is defined as \(G=(V,E, A)\), where V is the nodes set, and E is the edges set. \(A\in \mathcal {R}^{N\times N}\) is the adjacency matrix of the graph. The entry \(A_{ij}=1\) if \((v_i,v_j)\in E\), indicating that there is a correlation between node \(v_i\) and \(v_j\), otherwise \(A_{ij}\)=0. Given the time series data \(X_{1:T} = \left( X_{1},X_{2},\dots , X_{T} \right) \in \mathcal {R}^{T\times N\times C} \), where C denotes the number of features being considered, N is the node number, and T is the number of time steps. The forecasting problem involves predicting the data at a future time step based on historical data, denoted as \(X_{T+1} \in R^{N\times C}\)

Learnable data augmentation

Our automated augmentation strategy consists of two approaches: node level augmentation and edge-level augmentation. These two approaches precisely focus on the two core components of the original graph, nodes and edges, and extract appropriate views to further assist the encoder in capturing the semantic commonalities between node views and topological views. The combination of node-level and road-level augmentation provides a comprehensive approach to data augmentation. They jointly capture both the broader semantic features. Figure 1 illustrates the two augmentation mechanisms.

Node-level augmentation

The node-level augmentation introduces diversity by perturbing partial nodes at the specific time step. Specifically, we randomly select a subset of nodes at a particular time step and apply perturbations to their corresponding embeddings. The selection of nodes to be perturbed is an important aspect that contributes to the diversity introduced in the data. In our mechanism, we consider the aggregation weight as a criterion for selecting the nodes to be perturbed, as it reflects the importance or influence of a nodes in the overall patterns. In the node-level augmentation, the masking probability \(\rho _{t,i}\) is determined based on the aggregation weight \(\lambda _{t,i}\) using a Bernoulli distribution, that’s \(\rho _{t,i} \sim Bern\left( 1-\lambda _{t,i} \right) \). \(\rho _{t,i}\) indicates the likelihood of masking the feature \(X_{t,i}\) for region i at time step t. However, directly applying a random deletion method based on the masking probability can potentially lead to the unintended masking of crucial data.

To address this issue and prevent the masking of important information, we design a filtering strategy with mean operation. This strategy ensures that high-importance data is not masked while increasing the probability of masking low-importance data. The filtering strategy involves calculating the arithmetic mean of the masking probabilities \(\rho _{t,i}\) for all regions at a specific time step t. Let’s denote this mean as \(\rho _1\), which serves as the masking threshold. The calculation is as follows:

$$\begin{aligned} \rho _1 = \frac{1}{N} \sum _{i=1}^{N} \rho _{t,i} \end{aligned}$$
(1)

Given this threshold, we can determine whether to mask the data for a particular node i at time step t. If \(\rho _{t,i}\) is greater than \(\rho _1\), a random masking method is applied to the data for that region. On the other hand, if \(\rho _{t,i}\) is less than \(\rho _1\), the masking probability for the data of node i is set to 0, ensuring that the data is not masked. We use \(G_{NM}\) to denote the generated view, and corresponding embedding is denoted as \(E_{NM}\).

Edge-level augmentation

The edge-level augmentation considers the fine-grained details of the graph. The process can be divided into two aspects: non-adjacent nodes and adjacent nodes.

Adding edges. Adding edge involves considering the correlation between pairs of non-adjacent nodes in the graph. This correlation is represented by the probability \(\rho _{i,j}\), which is calculated using the Bernoulli distribution: \(\rho _{i,j} \sim Bern(\psi _{i,j})\). A higher value of \(\rho _{i,j}\) indicates a stronger correlation between non-adjacent nodes i and j. To ensure that edges are added between highly correlated nodes while avoiding the addition of edges between nodes with low correlation, a filtering strategy is proposed. This strategy utilizes a threshold, denoted as \(\rho _2\), which is determined by taking the arithmetic mean of \(\rho _{i,j}\) for all non-adjacent nodes in graph. The formulation is as follows:

$$\begin{aligned} \rho _2 = \frac{1}{a} \sum _{e_{ij}\not \in E}^{} \rho _{i,j} \end{aligned}$$
(2)

where a represents the number of non-adjacent nodes pairs. Using this threshold, we can decide whether to add an edge between a pair of non-adjacent regions i and j. When \(\rho _{i,j} > \rho _2\), the random addition strategy is employed, and an edge is added between regions i and j.

This filtering strategy prioritizes the addition of edges between highly correlated non-adjacent regions while preventing the addition of edges between regions with low correlation. We use \(G_{EA}\) to denote the generated view, and corresponding embedding is denoted as \(E_{EA}\).

Deleting edges. For adjacent nodes i and j, the probability of deleting the edge between them can be obtained using the Bernoulli distribution: \(\rho _{i,j} \sim Bern(1-\psi _{i,j})\). Here, \(\psi _{i,j}\) represents the correlation between adjacent nodes i and j. A higher value of \(\rho _{i,j}\) indicates a lower correlation between the adjacent nodes and a higher likelihood of deleting the edge. However, it’s important to note that a low value of \(\rho _{i,j}\) may also result in deleting the edge between nodes i and j, which can have negative effects on the data. To address this issue, a filtering strategy is proposed to increase the probability of deleting edges between low-correlation adjacent nodes while ensuring that edges between highly correlated adjacent nodes are not deleted.

The filtering strategy involves calculating the arithmetic mean of \(\rho _{i,j}\) for all adjacent regions in the region graph. Let’s denote this mean as \(\rho _3\), which serves as the threshold for distinguishing the correlation between adjacent regions. The formulation is as follows:

$$\begin{aligned} \rho _3 = \frac{1}{b} \sum _{\sum _{e_{ij}} \in E} \rho _{i,j} \end{aligned}$$
(3)

where b represents the number of adjacent regions pairs, satisfying \(a + b = \frac{N(N-1)}{2}\). Using this threshold \(\rho _3\), we can determine whether to delete an edge between a pair of adjacent regions i and j. When \(\rho _{i,j} > \rho _3\), the probability of deleting \(e_{ij}\) remains unchanged, instead a random deletion strategy is applied. On the other hand, the probability of deleting \(e_{ij}\) is set to 0, ensuring that the edge is not deleted. We use \(G_{ED}\) to denote the generated view, and corresponding embedding is denoted as \(E_{ED}\).

Joint training strategy

Fig. 2
figure 2

The illustration of joint training

In this section, we introduce our joint training strategy. Specifically, the whole procedure is conducted in two stages, as shown in Fig. 2. Stage 1 processes from the temporal dimension, and stage 2 processes from the spatial dimension.

Stage 1. We aim to enhance the semantic contextual information representation capability of embeddings. The whole procedure is divided into three stages: First, the original data and augmented data are fused using element-wise multiplication to form node embeddings. This fusion is performed for each node i at time step t as follows:

$$\begin{aligned} m_{t,i}= e_{t,i} \odot W_{1} + \tilde{e} _{t,i} \odot W_{2} \end{aligned}$$
(4)

where \(e_{t,i}\) represents the node embeddings of original data for ndoee i at time step t, \(\tilde{e}_{t,i}\) represents the embeddings of augmented data for node i at time step t, and \(W_{1}\) and \(W_{2}\) are learnable parameters.

Then, the node embeddings \(m_{t,i}\) obtained in the previous step are aggregated to generate a global representation \(M_{t}\) for time step t. The aggregation is performed by taking the average of the node embeddings across N nodes:

$$\begin{aligned} M_{t} = \frac{1}{N} \sum _{i=1}^{N} m_{t,i} \end{aligned}$$
(5)

Finally, the node embedding \(m_{t,i}\) and the global representation \(M_{t}\) are treated as positive pairs, as they capture the trend of sequence changes at the current time step. The node embedding \(m_{t,i}\) and the embedding \(m_{t',i}\) for other time steps \(t'\) are treated as negative pairs, as they capture the semantic contextual information of data between different time steps. The cross-entropy loss function is used for optimization::

$$\begin{aligned} \mathcal {L} _{t}=-\left( \sum _{i=1}^{N} log\left( m_{t,i},M_{t} \right) + \sum _{i=1}^{N} log\left( 1 - \left( m_{t,i},m_{t',i} \right) \right) \right) \end{aligned}$$
(6)

where t and \(t'\) represent two different time steps.

Stage 2. In this stage, we aim to capture the semantic contextual information of data in the spatial dimension. Firstly, we fuse the spatial semantic contextual information, represented by Q categories, with the augmented node embeddings. Formally, the fusion process is as follows: \(u_{i,q}= r_{q}^{T} \tilde{e} _{i} \), where \(r_{q}\) represents the embedding of the q-th category, and \(u_{i,q}\) represents the strength of relevance. Then, the category score for node i is as follows: \(u_{i}=\left( u_{i,1},u_{i,2},...,u_{i,Q}\right) ^{T}\) Next, we use the category representation of the original node embedding for the self-supervised task: \(v_{i,q}= r_{q}^{T} e _{i} \). Finally, the loss function for the self-supervised task is as follows:

$$\begin{aligned} \mathcal {L}_{s} = -\sum _{i=1}^{N} \sum _{q=1}^{Q} u_{i,q} log\frac{exp\left( v_{i,q}/\theta \right) }{ \sum _{q=1}^{Q} exp\left( v_{i,q}/ \theta \right) } \end{aligned}$$
(7)

where \(\theta \) represents the temperature parameter. The overall optimization objective for the semantic contextual self-supervised learning task is as follows:

$$\begin{aligned} \mathcal {L} _{st}=\mathcal {L} _{t}\cdot W_{3} +\mathcal {L} _{s}\cdot W_{4} \end{aligned}$$
(8)

where \(W_{3}\) and \(W_{4}\) represent learnable parameters. Through these two self-supervised tasks, we can effectively assist the model in capturing semantic contextual information in data.

Encoder

In this section, we design the encoder to model the data correlations. In particular, we take the traffic forecasting as an example of downstream task. The encoder consists of two parallel channels, each combining gated-TCN and multi-scale spatial-convolutional layers. Both channels in the model adopt a "sandwich" structure. The first and third layers of each channel are gated-TCN layers, responsible for capturing temporal dependencies. The second layer of each channel is a multi-scale spatial-convolutional layer, which focuses on capturing spatial dependencies.

Case 1: Learning Temporal Semantics. To capture temporal correlations, the original traffic tensor and the augmented traffic tensor are separately input into two parallel gated-TCN layers. These layers process the input tensors and produce embedding matrices with temporal awareness. The outputs of the gated-TCN layers are denoted as \(Y_t \in R^{N \times D}\), representing the embedding matrices at each time step. Specifically, the procedure is formulated as:

$$\begin{aligned} \left( Y_{1} ,Y_{2},\dots ,Y_{T} \right) = GatedTC\left( X_{1} ,X_{2},\dots ,X_{T} \right) \end{aligned}$$
(9)

Case 2: Learning Spatial Semantic. Different multi-scale convolutional layers, denoted as \(MultiSC_1\) and \(MultiSC_2\), are used in the two channels of the module to capture the spatial correlations in traffic data. These layers take the embedding matrices \(Y_t\) from the gated-TCN layers, along with the adjacency matrix A of the region graph. The procedure is formuated as:

$$\begin{aligned} \begin{matrix}\left( Z_{1}^1 ,Z_{2}^1,\dots ,Z_{T}^1 \right) = MultiSC_1\left( Y_{1} ,Y_{2},\dots ,Y_{T} ,A \right) \\ \left( Z_{1}^2 ,Z_{2}^2,\dots ,Z_{T}^2 \right) = MultiSC_2\left( Y_{1} ,Y_{2},\dots ,Y_{T} ,A \right) \end{matrix} \end{aligned}$$
(10)

where A is the adjacency matrix of the region graph. The multi-scale spatial-convolutional layers with different receptive fields are designed to effectively capture both local and global spatio-temporal features simultaneously. The outputs \(\left( Z_{1}^1, Z_{2}^1,\dots ,Z_{T}^1 \right) \) and \(\left( Z_{1}^2,Z_{2}^2,\dots ,Z_{T}^2 \right) \) are then fed into the gated-TCN layers, respectively. Finally, a dropout operation is performed to obtain the final embedding matrix \(E=\{\mathbf {e_1},\mathbf {e_2},\dots ,\mathbf {e_N}\}\), which is used for the prediction task and the self-supervised learning tasks.

Loss function definition

After the above process, we input all the original node embeddings \(e_{i}\) into FC layers to predict the future trend at time step \(t+1\):

$$\begin{aligned} X_{T+1,i}= FC Layers\left( \textbf{e}_{i} \right) \end{aligned}$$
(11)

where FC layers consist of two fully connected layers. The overall model loss function is as follows:

$$\begin{aligned} \mathcal {L}_{all} = \sum _{i=1}^{N} \left| X_{T+1,i} - \hat{X} _{T+1,i} \right| \cdot W_{5} +\mathcal {L}_{st} \end{aligned}$$
(12)

where \(X_{T+1,i}\) is the predicted result, \(\hat{X} _{T+1,i}\) is the ground truth, and \(W_{5}\) is a learnable parameter.

Experiment

In this section, we conducted comprehensive experiments on four real word datasets to evaluate the perforamnce of our proposed model.

Experimental settings

Implementation details

All experimental evaluations are performed on a single NVIDIA RTX 3060 GPU hardware platform. We implemented the proposed network based on the pytorch framework. We set the maximum training epoch to 100 using the early stop mechanism. The temporal convolution kernel size of the Encoder is set to 3. The multi-scale spatial convolution kernels of the Encoder in the first layer are set to 1, 2, 3, and in the second layer, they are set to 3, 4, 5. Adam optimizer is used during the training phase with a batch size of 32. The embedding dimension D is set as 64. The perturbation ratio for both node-level and edge-level data augmentation is set to 0.1.

Table 1 Description Of Datasets
Table 2 Comparison of experimental results of diffrent approaches

Evaluation metrics and baselines

In order to compare these methods quantitatively, we use two metrics to evaluate the performance of models in the field of traffic forecasting: mean absolute error (MAE), mean absolute percentage error (MAPE).

We compare SMF-STCN with the following baseline models. ARIMA [25], SVR [26], GMAN [27], STFGNN [28], STGCL [29], COST [30], CIGA [31], STNSCM [32], CauST [33], and STSSL [34].

Datasets

In the experiments, we take the traffic data as reference to evaluate our proposed model for solve traffic forecasting task. The datasets include BJTaxi, NYCBike1, NYCBike2, and NYCTaxi. These datasets were selected based on several considerations that make them suitable for evaluating the effectiveness of our framework. The specific details of the datasets are shown in the Table 1.

Main results

In this section, we examine the experimental results of JDAGCL in comparison with other baseline methods. Table 2 displays the detailed results. Overall, the JDAGCL model outperforms all baselines consistently across the four real datasets. Notably, the MAE values of JDAGCL show an average improvement of 14% compared to the other models. The improvement in prediction accuracy can be attributed to learning and optimizing the correlation between the time series in the dataset. While on the NYCTaxi dataset, the MAPE value of JDAGCL may not be the best, its MAE value outperforms the other baselines, possibly due to the limited size of the training samples.

In-depth analysis of joint training strategy

We validate and analyze the effectiveness of the joint training strategy through the following comparative experiments.

  • JDAGCL/SS: remove the Stage 2 from the joint training procedure.

  • JDAGCL/ST: remove the Stage 1 from the joint training procedure.

  • JDAGCL: Stage 1 and Stage 2 are jointly training.

Fig. 3
figure 3

Ablation results of joint training

Figure 3 visualizes the performance of various models on the NYCBike1 and NYCBike2 datasets. It can be observed that among all the components, data augmentation has the most significant impact on predictive performance. In the absence of region-level data augmentation, the MAE for input flow on the NYCBike1 dataset increased from 4.89 to 5.01, and the MAE for output flow increased from 5.18 to 5.31. In the absence of edge-level data augmentation, the MAE for input flow on the NYCBike2 dataset increased from 4.97 to 5.08, and the MAE for output flow increased from 4.63 to 4.75. We conclude that data augmentation significantly improves forecasting results. From the figure, we can also observe that the impact of the GCN on the model’s prediction results is the second-largest. These results suggest that the GCN at different scales contribute to the model’s better capture of spatial features. In particular, both two stage learning tasks contribute to improving prediction accuracy.

Furthermore, we randomly selected a subset of experimental data from the NYCBike1 and NYCBike2 datasets for visualizing the prediction errors. Figure 4 illustrates the results of the actual and predicted traffic flow values over 100 time steps. It can be observed that our model accurately predicts the sudden increase in traffic flow features, demonstrating that JDAGCL accurately captures the trends in traffic features.

Fig. 4
figure 4

Visualization of the forecasting results

The effectiveness of augmentation

To further analyze the effect of our proposed augmentation strategy, we conduct ablation study by designing the following three variations.

  • JDAGCL-ED: Only using the edge-level augmenter.

  • JDAGCL-FM: Only using the node-level augmenter.

  • JDAGCL-Both: Using both two augmenters.

Figure 5 displays the performance of these variations on two datasets. It is evident that combining two augmenters enhances performance, as a single augmenter limits the GNN encoder’s capacity to learn graph semantic commonality from various perspectives.

Analysis of sampling strategy and parameter sensitivity

In this section, we conducted experiments to analyze the influence of sampling percentage on the model performance, Fig. 6 presents the results. It is worth noting that the MAE values achieve the best performance when the data augmentation ratio is set to 0.1. Given that MAE is the primary metric, we set the data augmentation ratio to 0.1 for the entire experimental process. Specifically, between 0.1 and 0.7, both MAE and MAPE show an increasing trend, while they exhibit a decreasing trend when exceeding 0.7. And we can conclude that data augmentation has a significant impact on improving the model’s predictive performance and capturing spatio-temporal features. These findings highlight the significant impact of data augmentation on improving the model’s prediction performance and capturing spatio-temporal features.

Fig. 5
figure 5

Ablation experiments of different augmentation approach

Fig. 6
figure 6

Sensitivity analysis of sampling strategy

Figure 7 illustrates the variation of the loss function over epochs for our model on the NYCBike1 and NYCBike2 datasets. The training loss exhibits a similar decreasing trend across different datasets, ultimately stabilizing, indicating the stability of our model.

Conclusion

In this paper, we present JDAGCL, a novel framework for graph contrastive learning that enhances learning and forecasting performance through joint training. Our approach incorporates a carefully designed data augmentation strategy to simultaneously improve data quality and prediction accuracy. We introduce two stages training task that helps the model for learning more robust features. Through comprehensive evaluations on four real-world datasets, we demonstrate that our proposed method outperforms state-of-the-art baselines. The results validate the efficacy of our framework in enhancing learning and forecasting performance for graph-based data.

Fig. 7
figure 7

Training process of the model

While our proposed method demonstrates superior performance compared to state-of-the-art baselines on the evaluated datasets, it is important to acknowledge some limitations of our work. These limitations include: (1) Scalability: Although our framework performs well on the evaluated datasets, its scalability to larger and more complex graph structures remains an open question. (2) Hyperparameter Sensitivity: The performance of our framework is dependent on the selection of hyperparameters. The sensitivity of the model’s performance to these hyperparameters necessitates careful tuning and validation, which requires significant computational resources. (3) Applicability Assumptions: Our proposed framework assumes certain characteristics and properties of the input graph data, such as the availability of attribute information or the presence of meaningful graph structures. While these assumptions hold for many real-world scenarios, they may not be universally applicable to all types of graph data.

In our future research, we intend to extend the application of our proposed approach to different domains and datasets, such as energy and power. By exploring diverse datasets, we aim to evaluate the universality and generalizability of our model. Additionally, we plan to investigate the influence of external factors on the performance of our model, further enhancing its robustness and applicability.