1 Introduction

Graph clustering with attribute features and graph structure information is a hot topic in the study of graph data. It refers to partitioning the nodes of a complete graph into various groups without the guidance of additional information. The samples in the same region have higher similarity, while the samples between different regions have a relatively lower similarity. Some traditional clustering algorithms can only uncover node information or structural information of the graph, such as k-means [1], spectral clustering [2], and DeepWalk [3]. At the same time, with the growth of data volume, the computational efficiency and performance of traditional clustering are relatively low in the face of sparse, high-dimensional, non-Euclidean space data.

Clustering tasks for regularized data (e.g., in the image domain) are currently achieving impressive success by combining a deep neural network framework represented by Auto-encoder (AE) and its variants [4,5,6] with traditional clustering techniques, such as [7,8,9,10,11,12,13,14]. This has provided insight into graph clustering, and one of the main methods is to combine graph embedding learning with graph clustering. Firstly, graph embedding is used to reduce the dimensionality of the original data and map it to a lower dimensional space. Then, graph clustering is carried out using the embedding that extracts the discriminative information. The main graph embedding methods include graph convolutional network (GCN) [15], graph attention network (GAT) [16], and their combined product with auto-encoder, i.e., graph auto-encoder (GAE) [17] and its variants [18,19,20,21,22,23,24]. The training process of the above methods is two-steps, i.e., the clustering loss is independent of the optimization of the model, which will lead to sub-optimal training performance. EGAE [25] uses a relaxed k-means, which ensures the orthogonal property of the obtained embedding while allowing the clustering and reconstruction to be optimized simultaneously. However, the above approach focuses on the information obtained by model embedding and ignores the information generated by model clustering.

Recently, self-supervised learning, a new paradigm of representation learning, learns supervised information from the data itself without relying on manual labels. Two main methods that can effectively exploit cluster information: self-supervision and pseudo-supervision [26], both of which belong to unsupervised training methods. The former obtains a higher confidence auxiliary distribution by designing a pretext task and later uses it to supervise the target distribution training, such as DEC [7] and IDEC [27]. The latter guides the downstream model for semi-supervised training through the pseudo-labels obtained from clustering, such as Deep Cluster [28], IDCEC [29], and DSCNSS [30]. In addition to the above two methods, contrastive learning (CC) can achieve the purpose of self-supervised learning by constructing positive and negative sample sets to guide model training through data augmentation. However, there are relatively fewer studies based on self-supervised learning on graph data compared to other fields. Most of the methods mentioned above are based on the Euclidean distance between the embedding and the k-means cluster center to determine the confidence level of the sample, and they lack the exploration of the sample reliability mechanism of other clustering algorithms. Many self-supervised learning methods aim to learn representative features without label information. However, whether the self-supervised mechanism can improve the fusion of topological structure and node features in GCN remains to be explored.

To fully use the supervised information from graph clustering and reduce the damage of noise in this information to the accuracy of downstream models, we use pseudo-labels generated by unsupervised learning in the pretext task as supervised information to guide the partitioning of sample nodes. We propose a new self-supervised graph clustering model and elaborate the symmetric graph auto-encoder clustering model in the pretext task. In this model, a self-training module is added to the symmetric graph auto-encoder to optimize the clustering and embedding learning simultaneously. At the same time, pseudo-labels obtained by clustering are used to select samples with high confidence to train the downstream model.

Specifically, our contributions are summarized as follows:

  1. 1.

    We propose a new self-supervised graph clustering model in which we divide the graph clustering task into two relatively independent processes: a pretext task that performs unsupervised clustering and a downstream task that performs semi-supervised learning. In the absence of real labels, we use the pseudo-labels obtained after sufficient training of the pretext task as supervised information to guide the training of the downstream model.

  2. 2.

    We propose a new self-training method that does not require k-means cluster centers to further improve the accuracy of the upstream model. We also propose a reliable sample selection mechanism to reduce the negative impact of the cluster’s noisy samples at the decision boundary on the model training and improve the quality of the pseudo-label as supervised information.

  3. 3.

    We conduct extensive experiments on four major citation network datasets. It shows that the pretext model can guide the downstream model to achieve higher performance. At the same time, compared with the unsupervised graph clustering algorithms in recent years, our DSAGC is competitive.

2 Related Work

2.1 Attribute Graph Clustering

Attribute graph clustering aims to divide nodes of a graph into disjoint regions by using node features and topology information of graph data. However, some traditional clustering methods have limited ability to extract graph information and can only extract graph node information or topology information, such as k-means [1], spectral clustering [2], and DeepWalk [3]. Thanks to the development of graph neural networks (GNNs), especially GCNs [15, 31, 32], there has been a significant breakthrough in attribute graph clustering techniques. Meanwhile, many current graph clustering algorithms combine graph embedding and graph clustering. A typical example is GAE [17] and its variants. MGAE [21] proposes to use noise to destroy node features and later induce the GAE to learn the features after marginalizing noise to improve the model’s efficiency. ARGA [18] proposes to use an adversarial training approach to force the latent representations to match a certain prior distribution. The performance of traditional GCN decreases significantly when the number of layers exceeds two. AGC [23] proposes an adaptive graph convolution method to capture the information of high-order neighborhoods by using high-order graph convolution. GALA [20] proposes a symmetric graph auto-encoder whose decoder is learnable and symmetric to the encoder. It adopts Laplacian sharpening as a convolutional filter, which speeds up the process of model reconstruction. The above models utilize graph neural networks for representation learning, and the resulting embeddings are used for subsequent clustering tasks. Since the clustering process is independent of the model optimization, the embedding learned by the model hardly guarantees that the clustering task is optimal.

2.2 Self-Supervised Clustering

Self-supervised learning can use the information carried by the data itself to guide model training without manual labels. Nowadays, self-supervised learning has achieved good applications in various fields, such as natural language processing and computer vision. DeepCluster [28] proposes to use clustering to generate pseudo-labels, which are then used to guide the training of classifiers in an end-to-end manner. Deep subspace clustering [8, 33] maps the feature matrix to the embedding space, after which a self-expression matrix is generated, allowing nodes to be linearly represented by other nodes in the same subspace. DEC [7] generates high-confidence soft distributions by using student t-distributions and guides model training in a self-optimizing manner. DSSEC [34] builds on DEC by preserving local structure while using a stacked sparse auto-encoder to allow the model to learn more representative representations. SDCN [41] employs AE to extract attribute features and uses GNN to extract topological information, guiding the training of both modules simultaneously through a high-confidence soft distribution.

In recent years, many graph neural network models based on self-supervised methods have also emerged in the graph data domain. M3S [35] proposes a multi-stage training framework to compensate for the generalization ability of GCNs when there are fewer labels by continuously adding pseudo-labels to the labeled training set. SCRL [36] proposes to construct a feature graph by using node features and then use the shared information between the feature graph and the topology graph as the supervisory signal to guide the model’s training. CGCN [37] proposes to combine an attribute graph clustering network composed of variational graph auto-encoder and Gaussian mixture model with a semi-supervised classification network. If the pseudo-label of the unsupervised model is consistent with the semi-supervised prediction, the performance of semi-supervision is improved by adding the unlabeled nodes to the labeled set. GCA [38] and GraphCL [22] apply contrastive learning to the domain of graph datasets, forming two different views through data augmentation and enhancing the consistency of different levels of object information between views through carefully designed contrastive loss. DAEGC [39] employs a graph attention network to form an auto-encoder and guides the self-training of graph datasets by constructing higher confidence distributions. IDCEC [29] proposes to screen reliable samples according to the Euclidean distance between latent representations and cluster centers and then use these samples and their pseudo-labels to guide the training of downstream models. However, most of the above approaches focus on the task of graph node classification, while the exploration of the task of graph clustering is lacking. Therefore, we intend to design a new pretext task to guide the graph clustering task in a self-supervised manner.

Fig. 1
figure 1

The overall architecture of DSAGC. The model is mainly divided into three parts: the pretext task, the downstream task, and the reliable sample selection module. In the pretext task, the feature matrix X and adjacency matrix A are used as inputs. The self-supervised symmetric graph auto-encoder is trained by combining the self-supervised loss and reconstruction loss. After that, we use the obtained embedding matrix Z to construct a k-nearest neighbor matrix \(G_{1}\) and use it to perform spectral clustering. When the pretext model is fully trained, the samples with high confidence are screened according to the information of the pseudo-labels. In the downstream task, we perform \(\alpha \)-order generalized Laplacian smooth filter on the reliable samples and other samples and then input them into the DNN model. The pseudo-labels corresponding to the reliable samples are used as supervised information for training

3 Method

In this section, we will specifically introduce our proposed DSAGC model. The structure is shown in Fig. 1. Specifically, we describe the key modules in the model, propose an auxiliary distribution based on spectral clustering and a reliable sample selection method, and discuss the training strategy using the upstream model to guide the downstream model training.

3.1 Notations

A graph is represented as \(G=(V,E,X)\), where \(V=\{v_{1},v_{2},\ldots ,v_{n}\}\) denotes the nodes set with \(\vert V \vert =n\), \(E=\{e_{ij}\}\) denotes the edges set. Adjacency matrix \(A\in {\mathbb {R}}^{n*n}\), where \(A_{ij}=1\) if \((v_{i},v_{j})\in E\), otherwise \(A_{ij}=0\). Feature matrix \(X=\{x_{1}^{T},x_{2}^{T},\ldots ,x_{n}^{T}\}\), where \(x_{i}\in {\mathbb {R}}^{d}\) is a attribute vector associated with node \(v_{i}\). \({\widehat{y}}_{cluster}\) denotes the pseudo-labels generated by the upstream task, which has the simplified form \({\widehat{y}}^{c}\). \({\widehat{y}}_{self\_supervised}\) denotes the pseudo-labels generated by the downstream task, which has the simplified form \({\widehat{y}}^{s}\).

3.2 Deep Self-Supervised Attribute Graph Cluster (DSAGC)

Our proposed model is divided into three main parts: a pretext symmetric self-supervised graph auto-encoder module, a downstream semi-supervised classification module, and a reliable sample selection module. In addition to the basic reconstruction loss, the pretext symmetric self-supervised graph auto-encoder module also uses the information provided by spectral clustering to generate a high-confidence distribution. Then, it guides the target distribution for training, which enables simultaneous optimization of representation learning and clustering. When the upstream model is sufficiently trained, the reliable sample selection module utilizes the obtained pseudo-labels and sets thresholds to select samples with high confidence for each cluster, reducing the influence of noisy samples at the cluster boundaries. The downstream model is a semi-supervised model. Unlike other graph self-supervised algorithms that expand the labeled set by using pseudo-labels with high confidence, it only uses the reliable samples passed in from the pretext task with the corresponding pseudo-labels for training, without a real label involved in the training.

3.2.1 Pretext Task: Self-Supervised Symmetric Graph Auto-Encoder (SSGAE)

In the pretext task, we adopt a symmetric graph auto-encoder similar to the one in GALA[20]. In the encoder part, we use a two-layer GCN network where the structure of each layer is represented as follows: \(H^{\left( l+1\right) }=\sigma \left( {\widetilde{D}}^{-\frac{1}{2}}{\widetilde{A}}{\widetilde{D}}^{-\frac{1}{2}}H^{\left( l\right) }W^{\left( l\right) }\right) \), where \({\widetilde{A}}=A+I\), \({\widetilde{D}}_{ii}=\sum _{j=1}^{n}{\widetilde{A}}_{ij}\). In the decoder part, we use a two-layer Laplacian sharpening filter, where the structure of each layer is represented as follows: \(H^{\left( l+1\right) }=\sigma \left( {\widehat{D}}^{-\frac{1}{2}}{\widehat{A}}{\widehat{D}}^{-\frac{1}{2}}H^{\left( l\right) }W^{\left( l\right) }\right) \), where \( {\widehat{A}}=2I-A \), \( {\widehat{D}}=2I+D \). At the same time, we borrow from SGC [40] and keep the activation function only in the first layer of the encoder network and the last layer of the decoder network to reduce the probability of model overfitting. The encoder part of the model makes the features of neighboring samples gradually similar, and the decoder part restores the differences between samples so that the resulting embedding obtains more sample information. The basic loss function of the model is a reconstruction of the feature matrix X, which is expressed as follows.

$$\begin{aligned} L_{re}=\text {min}\left\| {\widehat{X}}-X\right\| _{F}^{2} \end{aligned}$$
(1)

Where \({\widehat{X}}\) is the reconstructed feature matrix and \( \left\| \cdot \right\| _{F} \) mainly refers to the Frobenius norm. However, it is difficult to rely solely on the reconstruction of graph data to ensure that learned embeddings are appropriate for specific downstream tasks. Some literatures [7, 27, 39, 41] have used the Student’s T-distribution to construct auxiliary distributions with higher confidence, using the additional information generated by clustering for guidance and obtaining better model performance.

The Student’s t-distribution is based on the Euclidean distance, and one of its basic assumptions is that the closer the distance of a sample to the cluster center, the higher the probability that it corresponds to this cluster. However, some combinations of graph auto-encoders and k-means are not ideal, which limits the effectiveness of this self-training method. One of our basic assumptions is that the larger the inner product distance, the more similar the samples are. Therefore, the higher the probability of belonging to the same cluster. We use the obtained embedding Z as well as the cosine similarity to construct the similarity matrix \(S\in {\mathbb {R}}^{n*n}\):

$$\begin{aligned} {S_{ij} = \frac{Z_{i,:}Z_{j,:}^T}{\Vert Z_{i,:}\Vert _2\Vert Z_{j,:}\Vert _2}} \end{aligned}$$
(2)

Here, \(\left\| \cdot \right\| _{2}\) denotes the L2 normalization. After that, we use the label information obtained by spectral clustering to divide the similarity matrix S. We multiply the values of node pairs in the same cluster by a factor \((1+t)\), while for node pairs that do not belong to the same cluster by a factor \((1-t)\) to obtain the auxiliary distribution \(S^{'}\), where \(t\in (0,1)\) is the hyper-parameter, when its value is larger, further widens the difference between the features of samples in the same cluster and those in different clusters:

$$\begin{aligned} {S_{ij}^{'}}= {\left\{ \begin{array}{ll} S_{ij}\times \left( 1+t\right) ,&{}{j\in C_{i}}\\ {S_{ij}\times \left( 1-t\right) ,}&{}{j\notin C_{i}} \end{array}\right. } \end{aligned}$$
(3)

The auxiliary distribution \(S^{'}\) is normalized to get:

$$\begin{aligned} S_{ij}^{'}=\frac{S_{ij}^{'}}{\sum _{k}S_{ik}^{'}} \end{aligned}$$
(4)

The model then performs self-training via KL divergence:

$$\begin{aligned} L_{se}=KL\left( S^{'}\Vert S\right) =\sum _{i}\sum _{j}S_{ij}^{'}log\frac{S_{ij}^{'}}{S_{ij}} \end{aligned}$$
(5)

We jointly optimize the embedding of the symmetric graph auto-encoder and clustering learning by defining the overall loss function of the pretext model as:

$$\begin{aligned} L=L_{re}+\gamma L_{se} \end{aligned}$$
(6)

Where \(L_{re}\) and \(L_{se}\) represent reconstruction loss and self-supervised loss, respectively, \(\gamma \) is the trade-off between them. Meanwhile, to avoid the instability of self-supervised optimization in the training process, we update the auxiliary distribution \(S^{'}\) every five iterations in the experiment.

3.2.2 Reliable Sample Selection

Most existing self-supervised works focus on using pseudo-labels generated by clustering as a complement to real labels for semi-supervised tasks. One of the main reasons for this is the presence of noise in the pseudo-labels, which accumulates errors with the training process and affects the overall performance of the model. IDCEC [29] proposes to use k-means to cluster the embeddings of image data and to use the Euclidean distance between the embedded nodes and the cluster centers as a measure of sample confidence. We choose the samples whose distances are less than some fixed threshold as reliable samples. This method effectively reduces the effect of sample embedding misclassification at the clustering boundary. However, this method depends on the quality of the clustering centers and is also affected by the number of samples in each cluster. Inspired by the above ideas, we explore a new mechanism for reliable sample selection based on the spectral clustering algorithm.

One of our basic ideas is that the selected samples need to satisfy the requirement of expanding the similarity between the target node and the rest of the samples in the same cluster while narrowing the similarity between it and the samples in different clusters.

Firstly, we use the pseudo-labels divide the samples into k disjoint clusters, \(C_{j},j=1,\ldots ,k\). Also, we use the similarity matrix S in Eq. 5 to select samples that belong to the same cluster \(C_{j}\) and sum the similarity of each of these target nodes u with the remaining nodes v to obtain a vector of \(\vert C_{j}\vert \):

$$\begin{aligned} D_{C_{j}}=\{d_{1},d_{2},\ldots d_{\vert C_{j}\vert }\},d_{u}=\sum _{v \in C_{j}}s_{uv} \end{aligned}$$
(7)

It measures the importance of the similarity of sample u in cluster \(C_{j}\). After that, we select the samples corresponding to the top \(k_{1}\%\) points with the largest values among them as the reliable samples \(sample_{1}\) corresponding to cluster \(C_{j}\).

In addition, we add an interval constraint \(\zeta \) between the similarity of samples in the same cluster and the similarity of samples in different clusters in order to filter out the noisy samples near the decision boundary. We subtract the maximum similarity \(s_{ua}\) of the target sample u in the same cluster and the maximum similarity \(s_{ub}\) in different clusters.

$$\begin{aligned} \zeta _{u}=s_{ua}-s_{ub}, {\left\{ \begin{array}{ll} s_{ua}=max\, s_{ui},&{}{i\in C_{j}}\\ {s_{ub}=max\, s_{uj}},&{}{j\notin C_{j}} \end{array}\right. } \end{aligned}$$
(8)

Considering that the similarity gap between same-cluster samples and opposite-cluster samples varies by cluster class, we decide to sort the \(\zeta _{u}\) of each cluster from largest to smallest and select the \(k_{2}\%\)-th of them as the interval of each cluster \(m_{j}\). Afterwards, for convenience, we let the minimum value of \(m_{j}\) be the uniform interval m. If the resulting difference is greater than or equal to the set interval m, we make it a reliable sample \(sample_{2}\). Ultimately, only samples that satisfy both of the previous two conditions can be selected as reliable samples sample, i.e.:

$$\begin{aligned} sample=sample_{1}\cap sample_{2} \end{aligned}$$
(9)

After selecting reliable samples, we use them and their corresponding pseudo-labels as a training set to guide the downstream classification task in a semi-supervised way.

3.2.3 Downstream Task

In most recent works, the pseudo-label is only used as an expanded supervised information to improve the performance of semi-supervised classification tasks. However, this approach still requires the use of real labels. In this paper, we try to use only pseudo-labels and the corresponding reliable samples as supervised information for the model. Meanwhile, we introduce a Laplacian smoothing filter mentioned in the Adaptive Graph Encoder (AGE) [24] to better filter the high-frequency noise in the reliable samples and improve the performance of the downstream model. We compare the differences in performance from two different graph convolution methods in the ablation study section.

First, the feature matrix is multiplied by the \(\alpha \)-layers graph convolution filter:

$$\begin{aligned} {\widetilde{X}}=H^{\alpha }X \end{aligned}$$
(10)

Here, H is a Laplacian smoothing filter followed by AGE, i.e., \(H=I-\beta {\widetilde{L}}\), where \({\widetilde{L}}\) represents the symmetric normalized graph Laplacian matrix and \(\beta \) represents the inverse of the corresponding spectral radius. The graph filter acts as a low-pass filter over the entire range of eigenvalues of the Laplacian matrix.

After that, we input the feature matrix \({\widetilde{X}}\) processed by the graph convolution filter into a two-level DNN model.

$$\begin{aligned} Y=\sigma \left( \sigma \left( {\widetilde{X}}W_{1}\right) W_{2}\right) \end{aligned}$$
(11)

Where Y represents the output of the overall downstream model, \(\sigma \) represents the activation function ReLU, and \(W_{i}\) represents the weight matrix of i-th layer. For each node \(v_{i}\), the result of its corresponding prediction label is:

$$\begin{aligned} {\widehat{y}}^{s}_{i}=\underset{u}{\text {arg min}}Y_{iu} \end{aligned}$$
(12)

The loss function of the downstream model is

$$\begin{aligned} L=-\sum _{i\in sample}{\widehat{y}}_{i}^{c}\text {ln}{\widehat{y}}_{i}^{s} \end{aligned}$$
(13)

Where we use only the samples from the reliable sample set sample as the training set. The pseudo-labels \({\widehat{y}}^{c}\) corresponding to these samples generated by the upstream model are used as supervisory information, and the prediction labels \({\widehat{y}}^{s}\) generated by the downstream model are used as training targets.

The proposed self-supervised attributed graph clustering and pretext task algorithms are described in Algorithm 2 and Algorithm 1, respectively.

Algorithm 1
figure a

Self-supervised Symmetric Graph Auto-Encoder (SSGAE)

Algorithm 2
figure b

Deep Self-supervised Attributed Graph Clustering (DSAGC)

4 Experiments

4.1 Datasets

To validate the effectiveness of our proposed DSAGC method on the node clustering task, we conduct extensive experiments on four benchmark datasets. These four datasets are all widely used citation network datasets, including Cora, Citeseer, Wiki, and Pubmed. The details of each dataset are summarized in Table 1.

Table 1 Summary of the datasets

4.2 Experiment Settings

In the pretext task, we adopt a symmetric graph auto-encoder as the basis and use a high-confidence distribution based on spectral clustering as the constraint term to make the model learn a representation that is more favorable to clustering. We follow GALA to construct the similarity matrix using KNN and embedding in the embedding space and use it later in the spectral clustering task, where the values of KNN are set to 20, 100, 20, 800, and the learning rates are set to 0.005, 0.005, 0.001, 0.001, respectively. We also adopt the Adam optimizer and use ReLU as the activation function at the input and output layers of the symmetric graph auto-encoder. In the training process of the pretext task, we first obtain the model parameters corresponding to the optimal accuracy of the pre-trained model and then add the loss function of the self-supervised term during the training process. Among them, for the convenience of training, we uniformly set the values of the coefficient of the self-supervised term \(\gamma \) in the overall loss function, t in the auxiliary distribution \(S^{'}\), and the update interval T in the four datasets to 5, 0.5, and 5, respectively.

Similarly, we save the model parameters corresponding to the clustering accuracy of the upstream model at convergence and later use the pseudo-labels and embeddings obtained in this condition to select reliable samples. The threshold \(k_{1}\) is set to 0.3, and the threshold \(k_{2}\) is set to 0.4 for reliable sample selection. In the downstream task, we input all samples into a 2-layer DNN model. Where the reliable samples selected by the upstream model are used as the training set, and their corresponding pseudo-labels are used as supervisory information. At the same time, before the samples are input into the model, the \(\alpha \)-order general Laplacian matrix is used to process the samples. The specific parameter settings are consistent with those in [24]. The Adam optimizer is chosen for the model, the learning rate is set to 0.005 for all datasets, and the activation function is ReLU. All codes are implemented by pytorch\(-\)1.7.0 on Windows 10.

4.3 Baselines and Evaluation Metrics

In the node clustering task, we mainly use three performance metrics to measure the model’s performance, i.e., ACC, NMI, and ARI. We compare the proposed method DSAGC with two methods: the clustering algorithm that only uses node features or graph structure information and the graph clustering algorithm based on deep learning. Algorithms that use only features or structures for clustering include: k-means [1], spectral clustering (SC) [2], Graph-Encoder [42], DeepWalk [3], DNGR [43], DEC [7], and TADW [44]. The deep graph clustering algorithms include: Graph Autoencoder (GAE), Variational Graph Autoencoder (VGAE) [17], Adversarial Regularized Graph Autoencoder (ARGE), Adversarial Regularized Variational Graph Autoencoder (ARVGE) [18], Deep Attentional Embedded Graph Clustering (DAEGC) [39], Embedding Graph Autoencoder (EGAE) [25], Adaptive Graph Convolution (AGC) [23], Graph convolutional Autoencoder using LAplacian smoothing and sharpening (GALA) [20], Graph Clustering via Variational Graph Embedding (GC-VGE) [45], Structural Deep Clustering Network (SDCN) [41].

The specific performance of each comparison experiment on the four datasets is shown in Table 2. The data in bold in the table indicates the best performance tested under this metric. The source of the data is mainly from the experimental results declared in the paper of the corresponding method. Some methods do not test these datasets in their original text, and we will cite results tested in other papers. The upstream model adopts a symmetric graph auto-encoder similar to GALA, but its performance lags behind the latter by \(1\%\) to \(2\%\), but with the addition of a self-supervised term, its effect is reversed in some datasets. We select reliable samples based on the optimal performance of the upstream model so that they serve as the only supervised information to guide the downstream model. In this case, we run the downstream model 10 times and take the average value as the final result. Compared to our upstream and downstream task models, the accuracy is essentially improved in all three performance metrics. This implies the feasibility of relying solely on reliable samples to guide the semi-supervised training of the downstream model. Also, our method can achieve very competitive results by comparing it with other methods. In particular, the ACC of the Wiki dataset has increased by nearly \(6\%\), and its ARI index has also increased by \(7\%\), while the other three datasets have also improved by \(1.5\%\) to \(2\%\). At the same time, we also see that not all performance metrics achieve optimal results on the Pubmed dataset. One of the main reasons is that spectral clustering can have difficulty guaranteeing its speed and accuracy when dealing with graph datasets with a very large number of nodes, which is also an aspect worth studying next.

Table 2 Experimental results of node clustering on Cora Dataset
Table 3 Experimental results of node clustering on Citeseer Dataset
Table 4 Experimental results of node clustering on Wiki Dataset
Table 5 Experimental results of node clustering on Pubmed Dataset
Table 6 The effect of using activation functions in different layers on symmetric graph auto-encoder

4.4 Ablation Study

In this section, we test three main aspects: (1) the effect of using activation functions in different layers on symmetric graph auto-encoder. (2) the effect of different self-optimizing methods on the performance of symmetric graph auto-encoder. (3) the effect of the selection of the downstream model on the final accuracy.

We conduct experiments on the use of activation functions in the upstream model and find that using activation functions only in the first and last layers in the symmetric graph auto-encoder can achieve better results. The experimental results are shown in Table 6. We hypothesize that using activation functions in the hidden and embedded layers with lower dimensionality of the graph symmetric auto-encoder leads to loss of some information, which affects the results of the feature matrix X reconstruction.

We compare the impact of using the currently popular Student T distribution and our proposed method as the self-optimizing term on the performance of the upstream symmetric graph auto-encoder. The former uses a combination of symmetric graph auto-encoder + k-means + self-optimizing method based on the Student T distribution (collectively referred to as combination I later). The latter uses a combination of symmetric graph auto-encoder + spectral clustering with KNN composition + the proposed method (collectively referred to as combination II later). The results are shown in Fig. 2, where the cherry red line shows combination II, the orange line shows combination I, the dashed line shows the result without adding the self-optimizing term, and the solid line shows the result with adding the self-optimizing term. We compare the performance of the two combinations with self-optimizing term coefficients from 1 to 10. Since the gradient of the model under the Pubmed dataset is smaller for combination I, we use the coefficients with a larger span. From the results in the figure, it can be seen that Combination II has outperformed Combination I without the addition of the self-optimizing term. In addition to the Citeseer dataset, the performance of combination I is sensitive to the coefficients of the self-optimizing terms, which does not completely ensure that the performance will be improved, especially for the Pubmed dataset, but the performance will be decreased. We conjecture that the dataset has many data samples, and it is difficult to obtain effective soft-label information just by using the Euclidean distance between the embedding points and the clustering centers when the clustering information is not particularly reliable. The effectiveness of the method is demonstrated by the effective performance improvement on the upstream models for all four datasets.

In terms of downstream model selection, we compare two different forms of graph convolution schemes: one is a traditional two-layer GCN structure, and the other uses a form of filter that is separated from the power matrix (i.e., a combined form of \(\alpha \)-nd power of the filter and DNN). The latter uses a general form of Laplacian smoothing filter, which takes into account the spectral radius of different datasets and is able to achieve better low-pass filtering results. We control for the same model structure and input samples. As shown in Fig. 3, the performance of the latter is slightly stronger than that of the former, which also shows to some extent that the winding of the filter and weight matrix does affect the performance of the model.

Fig. 2
figure 2

The effect of different self-optimization methods on the performance of symmetric graph auto-encoder. a Cora. b Citeseer. c Wiki. d, e Pubmed

Fig. 3
figure 3

The overall model accuracy of different downstream models. a ACC. b NMI. c ARI

4.5 Parameters Analysis

In this section, we will mainly analyze the influence of some hyper-parameters in the model.

Fig. 4
figure 4

The effect of sample selection coefficients \(k_{1}\) and \(k_{2}\) in the downstream model. a Only \(k_{1}\). b Only \(k_{2}\). c \(k_{2}=0.3\). d \(k_{1}=0.3\)

4.5.1 Sample Selection Coefficients \(k_{1}\) and \(k_{2}\).

In this section, to verify the effectiveness of the reliable sample selection mechanism, we test the effect of two reliable sample thresholds \(k_{1}\) and \(k_{2}\) on the downstream accuracy of the model using the Cora dataset as an example. Among them, Fig. 4a shows the effect of the value of \(k_{1}\) on the downstream model when only \(k_{1}\) is used as the reliable sample selection mechanism. Figure 4b shows the effect of the value of \(k_{2}\) on the downstream model when only \(k_{2}\) is used as the reliable sample selection mechanism. The three dashed lines in the figure indicate the performance of the upstream model under the three metrics as a reference for comparison. It is clear that both of the reliable sample selection mechanisms we employ can greatly improve the quality of pseudo-labels as supervisory information. When the value of \(k_{1}\) is small, we obtain samples with higher confidence, but the number of samples is small, so the overall generalization performance is poor. Since we select the samples according to percentages, we can ensure that the number of samples selected for each cluster class is relatively balanced. The overall model performance does not fluctuate significantly with changes of \(k_{1}\) and \(k_{2}\) values. After that, as the threshold value keeps increasing, the number of samples and reliability reach a balance point, achieving a better result. At the same time, we test the combination of two reliable samples. In particular, Fig. 4c shows the effect of changing the value of \(k_{1}\) on the downstream model when setting \(k_{2}=0.3\). Figure 4d shows the effect of varying the value of \(k_{1}\) on the downstream model when setting \(k_{1}=0.3\). We can see that when \(k_{1}\) or \(k_{2}\) is less than or equal to 0.3, the performance of the model is generally poor and even lower than the accuracy of the upstream model. The main reason for this is that there are not enough sample points to satisfy both conditions at the same time, so the performance of the model is seriously affected. As the reliable sample threshold increases, the effects of sample size and sample reliability on the model are balanced, and the downstream model outperforms the upstream model, reaching the goal of using pseudo-labels to guide the training of the downstream model.

Fig. 5
figure 5

2D t-SNE visualization of DSAGC on Cora and Pubmed. Where each column represents original, only reconstruction, pretext task, and downstream task respectively, i.e., the data distribution of the four phases of the model training. ad Cora. eh Pubmed

4.5.2 Visualization Analysis

To more intuitively reflect the impact of the two self-supervised terms on the data distribution at each stage of model training, we perform t-SNE visualization on two datasets. As shown in Fig. 5, it is observed that the model can perform an initial clustering of the samples when guided only by the reconstruction loss. However, the distances between the different clusters are still very close together. In contrast, when the clustering information is used to construct the trustworthy distribution to guide the training, the distance between the clusters is progressively expanded, and the same cluster samples move closer to each other. Through the reliable sample selection mechanism, we can further weaken the negative impact of cluster boundary samples and improve the performance of the downstream model.

5 Conclusion

In this paper, we propose a new deep self-supervised attributed graph clustering framework for social network datasets analysis. The model uses the pseudo-label information generated by clustering to construct a high-confidence distribution based on spectral clustering in the pretext task, which guides the model to learn embeddings that satisfy the specific clustering task. We further use the pseudo-label information to select reliable samples to assist in the training of the downstream model. Under the guidance of the pseudo-label, we use the cosine similarity between embedding to select the samples most similar to the other samples in the same cluster as the reliable samples. We also add a margin between the similarity of the same cluster samples and opposite cluster samples to improve its quality. We evaluate the proposed DSAGC on four popular benchmark datasets. The experimental results show the effectiveness of the proposed model on the node clustering task.