Introduction

Sentiment analysis is a prominent research area in natural language processing, involving the automatic extraction of emotional tendencies from text and the determination of people’s emotional attitudes towards various trending topics [1]. Text sentiment analysis plays a vital role in various fields, including government management, movie recommendations, public opinion analysis, and assisting researchers in making more informed strategic decisions [2].

In recent years, graph neural network (GNN) models have garnered significant attention for their effectiveness in modeling graph structures, particularly in applications like text sentiment analysis. Yao et al. introduced GCN (graph convolutional network) for heterogeneous text graphs, achieving promising results [3]. Niu et al. proposed a syntax-enhanced graph neural network model for sentiment analysis, further enhancing model performance [4]. However, GNN-based models still face challenges when dealing with text sentiment graphs [5]. Existing GNN models heavily rely on edge connections between graph nodes for representation learning. When constructing text sentiment graphs, the sheer volume of edges created results in high memory consumption. Additionally, due to the presence of edges, as GNN models undergo continuous training, the node representations obtained by such deep models tend to become overly smoothed and indistinguishable [6]. GNN models have demonstrated their ability to effectively preserve the global information of graph structures. However, there are challenges in handling local information among nodes. Therefore, when dealing with sentiment graphs in text processing, preserving the heterogeneity of text, improving model efficiency for effective scalability to large text sentiment corpora, and effectively retaining both global and local information within the graph structure have become significant challenges.

We distinguish ourselves from existing GNN models by introducing a novel Bert-based unlinked graph embedding (BUGE) model for text sentiment analysis. We utilize a small batch of linkless subgraph decomposition for input text sentiment graphs, breaking down large heterogeneous text sentiment graphs into several subgraphs without edge connections. This effectively reduces model memory usage and extends its applicability to large text corpora. By employing specific sampling strategies during the sampling process, we can efficiently preserve both global and local information within the graph structure, enabling nodes to receive more feature information. In the representation learning process, BUGE relies solely on attention mechanisms [7], without employing graph convolutions or aggregation operators, thereby addressing the issue of node oversmoothing and enhancing model training efficiency.

In this paper, our focus is on simplifying the edge connections between nodes in existing GNN models while retaining the heterogeneity of text and effectively preserving both global and local information within the graph structure. We aim to address the issue of node oversmoothing during model training. To achieve this, we employ specific sampling strategies on large-scale text sentiment heterogeneous graphs, creating unlinked small-batch isomorphic subgraphs. This approach eliminates the dependence on edges between graph nodes and can be effectively scaled to large text sentiment corpora. We demonstrate the effectiveness of this sampling method in current research on text sentiment analysis. Our primary contributions are as follows:

  1. (1)

    We introduce the BUGE model for sentiment analysis. We process the input text sentiment graph using small-batch unlinked isomorphic subgraph decomposition. This approach retains the heterogeneity of text while reducing the model’s storage and computational requirements, making it effectively scalable to large sentiment corpora.

  2. (2)

    We employ a specific subgraph sampling strategy that preserves local information while retaining the global information of the graph structure. During the representation learning process, the model relies entirely on attention mechanisms, thus addressing the issue of node oversmoothing and enhancing model efficiency.

  3. (3)

    We conduct experiments on several benchmark datasets, and the results validate the effectiveness of our model.

Related work

Graph based sentiment analysis

Recently, GNN has achieved good results in sentiment analysis [8], attracting the attention of many researchers. Sentiment analysis models the content text as a sentiment graph for feature extraction and then uses GNN for graph embedding. During graph embedding, the GNN model can maintain the global structural information of the sentiment graph and effectively deal with the complex relational structure between text sentiment words. Therefore, GNN is widely used in sentiment analysis tasks.Yao et al. proposed the Text-GCN model for text classification, which constructs a large text heterogeneity graph to describe the word–document and word–word relationships and then uses GCN model to complete text classification [3]. Huang et al. proposed TLGNN, a model that constructs a graph for each input text with global parameter sharing, alleviating the dependency between an individual text and the entire corpus [9]. However, these models require pre-constructed graph structures, which have limitations in practical applications. Therefore, Ding et al. proposed HyperGAT, which effectively captures complex node associations and hyperedge relationships through its graph attention mechanism and multi-layer structure. However, it is constrained by attention distribution when handling local information in the graph structure [10]. Zhu et al. proposed SSGC, where they added a self-loop to the Markov diffusion kernel and proposed a straightforward spectral map convolution. The simple spectral graph convolution used in this context strikes a balance between low-pass and high-pass filter frequency bands, effectively capturing both global and local information of nodes [11]. Zhang et al. proposed TextING, which incorporates a gating mechanism to alleviate the issue of excessive node smoothing. It is an inductive and versatile text classification model that can handle diverse text sentiment analysis tasks. However, it relies heavily on the quality of the text graph [12]. Zhu et al. proposed GL-GCN, a graph convolutional network guided by both global and local dependencies. It employs two GCNs to learn different dependency structures effectively, capturing both global and local contextual information [13]. Yang et al. proposed the CGA2TC model, which employs a contrastive learning approach to enhance model classification performance by using two different views. It should be noted that CGA2TC’s use of two text views increases computational overhead and does not scale well to large text corpora graphs [14].

Bert-based graph representation

In recent years, the powerful BERT model has made significant strides in natural language processing [15], prompting researchers to apply it to graph representation learning by combining BERT with GCN. Lu et al. proposed a VGCN-BERT model that combines the pre-trained BERT model with a lexical graph convolutional network to construct a large text heterogeneous graph. The attention mechanism is then used to interact with local and global information, influencing each other to jointly construct classification representations [16]. Yang et al. proposed a BERT-enhanced text network model (BEGNN) that considers both the semantic and structural information of a single text. It constructs a graph structure for each text and combines the graph neural network with BERT to extract features of varying granularity [17]. Lin et al. introduced a BERT-GCN model for text classification, which uses a graph to model the relationships between different samples from the entire corpus to leverage the similarity between labeled and unlabeled documents. It employs GNNs to learn these relationships [18]. Hao et al. introduced a novel defect prediction framework named EDP-BGCNN. This framework effectively harnesses the powerful capabilities of BERT and GCN for code representation and analysis, leading to more accurate defect prediction and enhancing the precision of defect prediction [19]. Numerous studies have demonstrated the effectiveness of Bert-based graph representation in related domains.

Compared to existing research, our work differs in several aspects. When dealing with text sentiment graphs, existing models often need to be more efficient due to the complexity of the constructed text sentiment graphs. As a result, we employ the method of small-batch linkless subgraph decomposition to partition a large text heterogeneous graph into multiple small-batch linkless subgraphs. This allows the model to process the graph without relying on edge connections, significantly enhancing its operational efficiency when applied to large text sentiment corpora. Throughout this process, we use specific sampling strategies to effectively preserve both the global and local information within the graph structure, enabling nodes to receive more feature information. Furthermore, our model relies solely on the attention mechanism, without the utilization of any graph convolutions or aggregation operators. This approach effectively resolves the issue of node oversmoothing present in existing GNN models.

Problem definition

Definition 1 Text sentiment network

We construct a text sentiment heterogeneous graph \(\mathcal {G}=(\mathcal {W},\mathcal {S},\mathcal {X},\mathcal {Y})\), where \( \mathcal {W} \) and \( \mathcal {S} \) respectively represent word nodes and sentiment nodes, \(\mathcal {X}\) represents edges between word nodes, and \(\mathcal {Y}\) represents edges between word nodes and sentiment nodes.

The text sentiment heterogeneous graph links words and sentiment words in the text together, represented as a graph.

Definition 2 Text graph embedding

Given an input graph \(\mathcal {G}\), we define U as the set of all nodes in the text sentiment graph. The task of text graph embedding is to learn a mapping function \(f: u_i \rightarrow u_i \in \mathbb {R}^{d}\) that embeds the nodes \(u \in U_i\) of the text graph network into low-dimensional latent representations \(X \in \mathbb {R}^{|U| \times d}\), where \(d \leqslant |U|\). These embeddings capture structural and sentiment information between nodes.

The obtained node embedding vectors can be used as feature inputs and embedded to complete the task of predicting sentiment relationships.

Definition 3 Node neighbourhood of linkless subgraph

We calculate the intimacy matrix \(\Gamma \) between nodes. For each target node \(u_i\in {U}\), we define its learning context as set \(\zeta _{u_{i}}=\left\{ {u_{j}}|{u_{j}}\in U{\setminus }\left\{ {u_{i}}\right\} \wedge {\Gamma (i,j)} \ge \theta _{i}\right\} \), \(\Gamma (i,j)\) measures the closeness score between word node \(u_i\) and word node \(u_j\), and \(\theta _i\) defines the minimum intimacy score threshold for nodes involved in \(u_i\)’s context.

To find the k nearest neighbor word nodes \(u_j \in U\) with the highest intimacy to node \(u_i\) according to \(\Gamma \), we can use \(\zeta _{u_i}\) to select the top-k intimate nodes of \(u_i\) in the graph \(\mathcal {G}\). By combining the context \(\zeta _{u_i}\) of the word node \(u_i\) and the node \(u_i\) itself, we can form a linkless subgraph \(g_i\). This complete heterogeneous text sentiment graph can be expressed as \(\mathcal {G} = {g_1, g_2,..., g_u}\).

Through this definition, we can determine the composition of neighbor nodes in the linkless subgraph, including nodes that are close to this node in the original large heterogeneous graph and those that are farther away.

Definition 4 Sentiment relationship prediction

For the prediction of sentiment relationship, we predict the sentiment contained in the target text based on the constructed text sentiment graph \(\mathcal {G}\). We define the prediction function as \(g:(\mathcal {W},\mathcal {S},\mathcal {X},\mathcal {Y}, v_i )\rightarrow {Z}\) to predict \( v_i\)’s sentiment relationship where \(Z = [Z_1, Z_2, \ldots , Z_i]\) represents the different possible results of the sentiment relationship prediction for \(v_i\).

Fig. 1
figure 1

The Architecture of the BUGE Model. (For example, a text is constructed into a text sentiment graph, and subgraphs are decomposed into several linkless subgraphs. We update the representation of nodes through the graph transformer layer and obtain the text sentiment classification through the readout function)

Proposed method

Overview

This section will explain how the model is utilized for text sentiment classification, as illustrated in Fig. 1. Initially, we introduce the creation of a heterogeneous text graph for sentiment analysis. Subsequently, we describe the sampling method used to generate multiple linkless connected subgraphs from the large text heterogeneous composition. Next, we discuss how node inputs are embedded in these subgraphs, followed by learning the node representation through the graph transformer encoder for classification.

Text sentiment graph

We aim to create a comprehensive and diverse text graph that includes both word and sentiment nodes to facilitate sentiment analysis, with a focus on capturing global word co-occurrences. To achieve this, we construct a text sentiment heterogeneous graph denoted as \(G=(\mathcal {W},\mathcal {S},\mathcal {X},\mathcal {Y})\), following the approach of Text-GCN. We establish edges between nodes to form a large and complex graph for the entire corpus. The weight of the edge between a sentiment node and a word node is determined by the word’s inverse document frequency (TF-IDF) in the document. Additionally, to leverage global word co-occurrence information, we collect word co-occurrence statistics by applying a fixed-size sliding window over all documents in the corpus. We calculate the weights of edges between two-word nodes using point-wise mutual information (PMI). Specifically, the weight of an edge between node \(w_i\) and node \(w_j\) is defined as:

$$\begin{aligned} \left\{ \begin{array}{ll} PMI(i,j) &{}i,j \text{ are } \text{ words },PMI(i,j)>0\\ TF-IDF(i,j) &{}i \text{ is } \text{ sentiment } \text{ word },j \text{ is } \text{ word }\\ 1 &{}i=j\\ 0 &{}otherwise \end{array} \right. \end{aligned}$$
(1)

The formula for TF-IDF is:

$$\begin{aligned} PMI\left( {w}_{i},{w}_{j}\right) =\log \frac{p_{i,j}}{p_{i}p_{j}}=\log \frac{N_{i,j}N}{N_{i}N_{j}} \end{aligned}$$
(2)

where \(N_i\) and \(N_j\) are the number of sliding windows in the corpus that contain the word \(w_i\) and \(w_j\) respectively, and \(N_{i, j}\) is the number of sliding windows that contain both words \(w_i\) and \(w_j\). N is the total number of sliding windows in the corpus.

The formula for TF-IDF is:

$$\begin{aligned} TF-IDF(w, s) = TF(w, s) * IDF(w) \end{aligned}$$
(3)

Here, w represents a word, and s represents a sentiment word. TF(ws) represents the frequency of word w in a sentiment word s, i.e., the number of times w appears in s. IDF(w) represents the inverse document frequency of word w and can be calculated using the following formula:

$$\begin{aligned} IDF(w) = log(N / (1 + n(w))) \end{aligned}$$
(4)

Here, N represents the total number of sentiment words in the corpus, and n(w) represents the number of sentiment words that contain the word w.

Bert-based graph embeddings learning

We have incorporated the Graph-bert model [20] into text sentiment analysis. It adopts the top-k intimacy sampling approach to calculate the intimacy degree between each node and all other nodes. It then selects the top k nodes with the largest intimacy values as neighbor nodes. It calculates the intimacy matrix \(\Gamma \) between the nodes in the complete graph using PageRank as follows:

$$\begin{aligned} \Gamma =\alpha \cdot {\left( I-\left( 1-\alpha \right) \cdot {\mathop {A}\limits ^{-}}\right) }^{-1} \end{aligned}$$
(5)

Here, \(\alpha \) is a factor in the range of [0, 1]. The term \(\overline{A} = AD^{-1}\) denotes the column-normalized adjacency matrix, where A is the adjacency matrix of the input graph, and D is the diagonal matrix corresponding to its diagonal.

According to the calculated intimacy matrix \(\Gamma \), for each sentiment target node \(s_i \in S\), we define its learning context as the set:

$$\begin{aligned} \zeta _{(S_{i})} = \left\{ w_{j} \big | w_{j} \in V\backslash \left\{ w_{i} + S_{i} \right\} \wedge \Gamma (i,j) \ge \theta _{i} \right\} \end{aligned}$$
(6)

Here, \(\theta _i\) is the minimum intimacy score threshold for nodes to be involved in \(s_i\)’s context.

For each sentiment target node \(s_i \in S\), we select the closest k neighbor nodes \(w_i \in W\) to sample the context subgraph \(\zeta _{s_i}\) of size k.

The nodes in the text sentiment analysis graph are updated for classification based on the sampled linkless connected subgraphs. Node feature vectors are obtained through embeddings of model input nodes.

Raw Feature Vector Embedding: Raw Feature Vector Embedding can capture the sentiment text and word types in the text graph. For each subgraph node \(W_j \in V_i\), we embed the raw feature vector into a feature space, where the raw feature vector is denoted as \(y_j\).

$$\begin{aligned} {q_j^{(x)} = \text {Embed}(y_j) \in \mathbb {R}^{d_h \times 1}} \end{aligned}$$
(7)

Here, the definition of the \(Embed(\cdot )\) function can be implemented using different models, and here we use BERT.

Relative Positional Embedding: We define the position of word nodes as \(P(v_i)\). By default, \(p_{v_i}\) is set to 0, and nodes closer to \(v_i\) have smaller position indices. For node \(v_j\), we can also extract its intimacy-based relative positional embedding using the P-\(Embed(\cdot )\) function defined above as follows:

$$\begin{aligned} {q_j^{(r)} = \text {P-Embed}{(P(v_i))} \in \mathbb {R}^{d_h \times 1}} \end{aligned}$$
(8)

Then, we apply the graph transformer method to the graph structure data. After calculating the two embedding vectors defined above, we aggregate them together as input to the encoder.

We organize all the input vectors in the subgraph \(g_i\) into a matrix \(H^{(0)} = \left[ h_i^{(0)}, h_{i,1}^{(0)}, \ldots , h_{i,k}^{(0)}\right] ^T \in \mathbb {R}^{(k+1) \times d_h}\), and then the representation of the node is gradually updated through the multi-layer attention operation:

$$\begin{aligned} \begin{aligned} H^{(l)}&=G\text {-Transformer} (H^{(l-1)})\\&=\text {softmax}\left( \frac{QK^T}{\sqrt{d_h}}\right) {V}+G\text{- }\text {Res}(H^{l-1},X_i) \end{aligned} \end{aligned}$$
(9)

where

$$\begin{aligned} \left\{ \begin{array}{ll} Q=H^{(l-1)}W_Q^{(l)}\\ K=H^{(l-1)}W_K^{(l)}\\ V=H^{(l-1)}W_V^{(l)} \end{array} \right. \end{aligned}$$
(10)

In the above equation, each graph transformer layer contains three trainable matrices: \(W_Q^{(l)}, W_K^{(l)}, W_V^{(l)} \in \mathbb {R}^{(d_h \times d_h)}\), and queries Q, keys K, and values V are generated by multiplying the input correspondingly. G-Res [21] refers to a residual network for solving the over-smoothing problem of GNNs.

The representation fusion layer averages the output embeddings of the D-th encoder layer to obtain \(z_i\) as the final representation of the target node \(v_i\):

$$\begin{aligned} z_i=\text {Fusion}{(H^{(D)})} \end{aligned}$$
(11)

Text sentiment node classification

After learning the node representations, we classify the nodes. The representation fusion layer averages the output embedding of the D-th encoder layer to obtain \(z_i\) as the final representation of the target node \(v_i\), which is then fed into a softmax classifier:

$$\begin{aligned} z_{i} = ~\text {softmax}\left( \text {average}\left( H^{(D)} \right) ~ \in R^{d_{h} \times 1} \right. \end{aligned}$$
(12)

In comparison with the real labels of nodes, we define the cross-entropy loss function as follows:

$$\begin{aligned} \mathcal {L} = -{\sum _{n\in \mathcal {T}}{\sum _{f=1}^{d_{y}}{y_{n}(f)\log {z_{n}(f)}}}} \end{aligned}$$
(13)

Here, \(n\in \mathcal {T}\) denotes the target word/sentiment word node in the training set, \(d_y\) is the label vector dimension, and \(y_n\) denotes the ground truth label vector of node n.

Through the joint training of the fully connected layer constructed above and the model, we can determine the label type of the node.

Experimental results and evaluation

In this section, we will evaluate the performance of the proposed model in text sentiment analysis. We will compare our proposed model with classic text sentiment analysis baseline methods, demonstrate the superiority of our model in text sentiment analysis, and document and analyze the experimental results.

Datasets

We conducted extensive experiments on three benchmark datasets:

MR (Movie Review): This dataset is used for binary sentiment classification of movie reviews, each consisting of a single sentence [22]. It comprises 10,662 samples, evenly split between 5331 positive and 5331 negative reviews [23].

IMDB (IMDB Movie Reviews): This large text sentiment corpus contains more data than previous benchmarks, consisting of 50,000 reviews from the Internet Movie Database, labeled as positive or negative [24]. It encompasses 25,000 positive and 25,000 negative sample reviews.

SST-5 (The Stanford Sentiment Treebank): Stanford University released this sentiment analysis dataset, which includes five labels: very positive, positive, neutral, negative, and very negative. This dataset provides a clearer distinction between emotions [25]. Detailed statistics for these datasets are listed in Table 1.

Table 1 Summary statistics of datasets

Experimental setup

For our experiments, we initialized the embeddings with pre-trained 300-dimensional Glove vectors [26]. We trained a two-layer graph transformer with a hidden size of 32 and 6 attention heads. We used Adam as the optimizer [27] with an initial learning rate of 0.001, which was decayed by a factor of \(1\text {e}-5\). We set the number of training epochs to 50.

Evaluation metrics

In our experiments, we adopted accuracy, precision, recall, and F1-score as performance metrics. accuracy measures the model’s ability to correctly classify samples, precision focuses on the model’s accuracy in the positive class, recall assesses the model’s capability to identify positive class samples, and F1-score provides a comprehensive evaluation by considering both Precision and Recall. These metrics help us comprehensively assess the model’s performance.

Accuracy: In binary classification, accuracy is calculated as the proportion of correctly predicted positive and negative samples to the total number of samples. In multi-class classification, accuracy is calculated by dividing the sum of correctly predicted samples by the total number of samples.

$$\begin{aligned} \text {Acc} = \frac{\text {TP}+\text {TN}}{\text {TP}+\text {TN}+\text {FP}+\text {FN}} \end{aligned}$$
(14)

where TP = true positive, FP = false positive, TN = true negative, and FN = false negative.

Precision: In binary classification, Precision represents the proportion of correctly predicted positive samples to all samples that were predicted as positive. In multi-class classification, Precision is calculated separately for each label, and then a weighted average is computed to account for class imbalances.

$$\begin{aligned} \text {Pr} = \frac{\text {TP}}{\text {TP}+\text {FP}} \end{aligned}$$
(15)

Recall: In binary classification, Recall, also known as True Positive Rate, measures the proportion of positive samples that were correctly predicted out of the total number of actual positive samples in the dataset. In multi-class classification, Recall is calculated for each label by dividing the number of correctly predicted samples in that label by the total number of actual samples in that label. A weighted average is then computed to account for class imbalances across labels.

$$\begin{aligned} \displaystyle \text {Re} = \frac{\text {TP}}{\text {TP}+\text {FN}} \end{aligned}$$
(16)

F1-score: The F1-score is the harmonic mean of Precision and Recall. The F1-score is calculated using the following formula:

$$\begin{aligned} \displaystyle \frac{2}{F}_{1} = \frac{1}{\text {Pr}} + \frac{1}{\text {Re}} \end{aligned}$$
(17)

Methods for comparison

We adopt the current popular text sentiment analysis models to conduct a performance comparison evaluation of our proposed model:

Text-GCN [3]: Constructs a heterogeneous graph based on text and words, enabling semi-supervised text classification using Graph Convolutional Networks.

TLGCN [9]: Builds a graph for each input text with global parameter sharing, alleviating the dependency between a single text and the entire corpus.

SSLGNN [28]: Introduces a new sparse structure learning model based on Graph Neural Networks.

HyperGAT [10]: it is a variant of Graph Neural Network designed to handle hypergraph data.

TensorGCN [29]: Utilizes a model for multi-angle mapping and fusion.

TextING [12]: Constructs a separate graph for each text and updates nodes through a message propagation mechanism.

SSGC [11]: Adds a self-loop to the Markov diffusion kernel and proposes a simple spectral map convolution with a softmax classifier after a linear layer.

CGA2TC [14]: Presents a novel framework of contrastive graph convolutional networks with adaptive augmentation.

GFN [30]: Introduces a unified graph fusion network that transforms external knowledge into structural information.

Results

Results analysis

We employ multiple metrics to present the experimental results of various sentiment analysis classification methods on our dataset. The experimental outcomes are detailed in Tables 2, 3, and 4. Our experiments demonstrate that our proposed model performs exceptionally well on benchmark datasets. Notably, IMDB represents a large sentiment text corpus with lengthier content in the text and a more extensive sample size, while each item in the MR dataset consists of relatively shorter text content. SST5 is a multi-category dataset. It’s worth noting that graph-based methods consistently excel across different datasets. Our model exhibits significant improvements on the IMDB dataset, primarily owing to the considerably longer average text length in IMDB compared to other datasets. When constructing text sentiment graphs, longer texts can establish more sentiment word connections transmitted through nodes, which enhances the ability to capture the relationship between target nodes and sentiment words. When combined with BUGE, this leads to superior performance. This might elucidate why the improvement is more pronounced on large text sentiment analysis datasets than on smaller text sentiment corpora like MR and SST5, which feature shorter texts. The constraints of the graph structure during text graph construction result in only slight performance improvements for datasets with shorter texts.

Table 2 Comparison of performance on different sentiment analysis approaches on the MR dataset
Table 3 Comparison of performance on different sentiment analysis approaches on the IMDB dataset
Table 4 Comparison of performance on different sentiment analysis approaches on the SST-5 dataset

Text-GCN and TLGCN use graph neural networks to train models, but both struggle to effectively preserve the heterogeneity of text data and face scalability challenges when applied to large-scale text sentiment corpora. TextING may not be effective in extracting text features while ignoring word order, especially in sentiment classification tasks. HyperGAT neglects the capture of local information in the graph structure, leading to the model’s inability to capture label correlations, thereby reducing its performance, especially in multi-label tasks.GFN introduces an innovative approach to construct a text sentiment graph, but like HyperGAT, it also fails to adequately capture local information within the graph structure. TensorGCN relies on pre-trained word embeddings and tends to perform well with shorter texts but fares poorly with longer ones. SSGC incorporates double-layer attention mechanisms, which prove advantageous for learning tasks involving lengthy texts. CGA2TC uses the multi-view contrasting method to enhance the classification performance of the model. However, the use of multi-view methods typically increases computational complexity due to the need to handle features from multiple views. This results in longer training times and higher computational resource requirements, causing the model to perform poorly on large text sentiment corpora. In addition, the models mentioned above all face the issue of node smoothing during the model training process. However, the BUGE model we employ, by disregarding edge connections in the text graph, effectively addresses the problem of excessive smoothing in the model and can be efficiently extended to handle large-scale text sentiment corpora. During the sampling process, we also preserve both the global and local information within the graph structure adequately.

Analysis of efficiency

In this section, we investigate the efficiency of our model to validate its effectiveness. We compare the time and memory usage of different models, and the experimental results are presented in Table 5. We observed that in terms of time and memory usage, our model outperforms other models in comparison. This is because our model samples large heterogeneous graphs into smaller batches of edge-less subgraphs, removing the constraints imposed by edges, resulting in lower memory and time consumption.

Table 5 Model efficiency comparison

Parameter sensitivity

In this section, we investigate the impact of different parameters on experimental results. While test recall and test accuracy serve as similar evaluation metrics for the model, we choose to showcase the model’s performance using test accuracy on the dataset. We present the experimental results under various parameters on the dataset, and Fig. 2 illustrates the effect of different values of k on experimental accuracy. The value of k determines the size of the sampled neighbor nodes when performing subgraph sampling without edges, which can significantly affect the experimental outcomes.

We examine the results with different k values and observe that on the MR dataset, test accuracy gradually increases from \(k=1\) and then decreases after reaching the optimal value of \(k=16\). On the IMDB dataset, the optimal value is \(k=24\), and on the SST-5 dataset, the optimal value is \(k=16\). We have also noticed similar trends on other datasets. As the value of k increases, the efficiency cost of training the model gradually rises. However, in comparison to other GNN models, our model exhibits significantly reduced training costs.

Fig. 2
figure 2

The influence of k on the experimental accuracy under different values

Additionally, we also took into account the influence of other parameters on the experimental results. Figure 3 displays the accuracy of the MR and IMDB datasets at various window sizes. The graph illustrates that test accuracy initially rises as the window size increases. It reaches its peak when the window size is set to 20. However, once the window size surpasses 20, test accuracy starts to decline. Consequently, we have chosen to set the window size to 20.

Fig. 3
figure 3

The influence of sliding window sizes on the experimental accuracy under different values

We also considered the impact of vector dimensions on experimental results. We conducted comparative experiments using sentence vectors of varying dimensions on both the MR and IMDB datasets. Our objective was to analyze how the vector dimension affects the experimental outcomes. The results of these experiments are presented in Fig. 4.We can see from this that as the vector dimension increases up to 150 dimensions, the model’s test accuracy reaches its maximum value. Beyond this point, the model’s accuracy starts to gradually decrease. When the dimension of embeddings is too low, it may fail to effectively preserve the original features of the nodes. Conversely, embeddings with excessively high dimensions can demand more training time. Consequently, we have chosen to set the output dimension of the first layer to 150.

Fig. 4
figure 4

The influence of dimensions on the experimental accuracy under different values

We also investigated the impact of different numbers of attention heads on the performance of neural networks, considering that multi-head attention is a crucial component in such models. To evaluate model performance, we used test accuracy as the metric for both the MR and IMDB datasets. The results, as presented in Fig. 5, indicate that our model achieved the highest test accuracy when utilizing six attention heads. Notably, the test accuracy reached its lowest point when the attention mechanism was not applied. Furthermore, the model’s test accuracy gradually decreased as the number of attention heads exceeded 6. These experimental findings emphasize the importance of selecting an appropriate number of attention heads, as it can significantly improve the model’s performance.

Fig. 5
figure 5

The influence of head attention on the experimental accuracy under different values

The learning rate is a crucial parameter in our experiment, and we explored different learning rate values to assess their effect on the model’s performance. As depicted in Fig. 6, we can observe that the model achieves its best performance when the learning rate is set to 0.001. The experiment began with a learning rate of 0.1, and as the learning rate was reduced, the model’s performance gradually improved. The optimal performance was achieved when the learning rate reached 0.001. Further reducing the learning rate led to a decline in the model’s performance. Therefore, we have set the learning rate to 0.001.

Fig. 6
figure 6

The influence of learning rate on the experimental accuracy under different values

Fig. 7
figure 7

The influence of epoch numbers on the experimental accuracy under different values

Finally, we investigated the impact of the number of epochs on our experiments. The epoch number is a parameter that requires adjustment, and by conducting experiments with various numbers of epochs, our goal was to determine the optimal value for our specific task, which would help optimize model performance. According to Fig. 7, it can be observed that when the number of epochs is set to 50, the model’s performance reaches its peak. Further increasing the number of epochs does not significantly improve performance. This may be attributed to an excessive number of epochs, which can lead to overfitting and a reduction in the model’s ability to generalize to new data.

Effects on the number of labelled data

To evaluate the influence of different proportions of training data on test accuracy, we conducted experiments using several models on MR datasets with varying proportions of training data. The comparison results are depicted in Fig. 8. Our BUGE model consistently demonstrated the best performance across all proportions of training data, achieving a test accuracy of 0.709 with only 10\(\%\) of the training data. This underscores our model’s capacity to perform well even with limited labeled data, showcasing its ability to effectively capture and retain textual information.

Fig. 8
figure 8

The influence of varying training data proportions on the experimental accuracy under different values

Conclusion and future work

In this paper, we introduce a novel Bert-based unlinked graph embedding (BUGE) model. Our approach demonstrates significant potential when handling large-scale text sentiment corpora graphs. By dividing the corpus into multiple unlinked subgraphs, each comprising a target node and its surrounding nodes without direct edge connections between them, our method enables representation learning that relies on attention mechanisms rather than graph connections. This effectively addresses the over-smoothing issue present in existing Graph Neural Network (GNN) models while enhancing model efficiency. Experimental results on multiple benchmark datasets have validated the effectiveness of our approach.

In the future, our research will focus on further improving the proposed BUGE model. We plan to integrate knowledge graphs into our graph pre-training model, enhancing interpretability by combining domain-specific knowledge graphs with data. This integration will enable better information retention and extraction when constructing text graphs, thereby further enhancing the model’s performance.