Social behavior prediction with graph U-Net+

We focus on the problem of predicting social media user’s future behavior and consider it as a graph node binary classification task. Existing works use graph representation learning methods to give each node an embedding vector, then update the node representations by designing different information passing and aggregation mechanisms, like GCN or GAT methods. In this paper, we follow the fact that social media users have influence on their neighbor area, and extract subgraph structures from real-world social networks. We propose an encoder–decoder architecture based on graph U-Net, known as the graph U-Net+. In order to improve the feature extraction capability in convolutional process and eliminate the effect of over-smoothing problem, we introduce the bilinear information aggregator and NodeNorm normalization approaches into both encoding and decoding blocks. We reuse four datasets from DeepInf and extensive experimental results demonstrate that our methods achieve better performance than previous models.


Introduction
With the explosive appearance of social media such as Twitter, WeChat, and Weibo, many people have joined the Internet and thus formed some huge social network neighborhoods. As shown in Fig. 1, these social media allow users to directly view the contents of all their online friends, and support users to further improve the breadth and depth of information propagation through online interactions, such as subscribing, sharing, and retweeting. During the process of user interaction, user-level social influence naturally appears and imperceptively "interferes" with the behavior and judgment of each user [1][2][3]. The term social influence refers to the impact that user's opinions, decisions, or behaviors may be affected by their social neighbors or some influential people. For example, we choose whether to watch movies based on suggestion of our closest friend. For another instance, a person may decide whether to believe a piece of political news based on what an authoritative official says. Therefore, collecting and analyzing users friendship network and history data makes it possible for governments or companies to predict users' future behavior, so as to determine whether he or she will engage in certain activities such as voting or purchasing. The resulting social user behavior prediction task has important value in many application fields, such as online advertising [4,5], recommendation systems [6][7][8], viral influence prediction in marketing [9][10][11], rumor spreading [12], and even presidential elections [13].
Indeed, extensive work has been done on social behavior prediction in the literature [2,[14][15][16]. For example, Li et al. [16] proposed an end-to-end predictor, DeepCas, which studies the diffusion cascade of social influence by 1 3 more efficiency. Compared to graph U-Net, the experimental result show that graph U-Net+ has better performance due to its effective architecture and powerful information aggregation mechanism. Moreover, we implement many ablation experiments to test effectiveness of different modules. In the end, we give the pseudocode of the entire program codes running for readers to better understand how we train, validate and test.
The rest of this paper is organized as follows: Sect. 2 introduces some related works. Section 3 gives the details of our proposed architecture graph U-Net+. In Sects. 4 and 5, we conduct extensive experiments on four illustrated datasets and analyze the results. Finally, we draw a conclusion in Sect. 6.

Related work
In this section, we will introduce some related work from two aspects: graph representation learning (information aggregation layer) and graph U-Net architecture.
Recently, there has been many works on graph representation learning. As a general framework, MPNN [34] abstracts the commonalities of existing models, and thinks of them as an unified process of information propagation and aggregation. Some studies applied this framework to graph node classification tasks. Inspired by the ChebNet [35] and first order graph Laplacian methods, Kipf and Welling proposed graph convolutional network (GCN) [17] which achieved promising performance. The information aggregation layer in GCN is defined as: where Â = A + I is used to add self-loops in the input adjacency matrix A, X is the node feature matrix which each row represents a feature vector of one node, W is a trainable weight matrix that applies a linear transformation to feature vectors of layer . GCN uses the diagonal node degree matrix D to normalize Â , and essentially perform information aggregation and transformation on node features like what MPNN does. In spatial domain, the GraphSage [18] samples a fixed number of neighboring nodes to simulate the information aggregation process of first-order approximate GCN after stacking multiple layers. GAT [23] uses self-attention mechanisms to assign different weights for neighboring nodes.
In addition to graph representation learning, some studies tried to extend pooling operations from CNN to graph domain. The ChebNet [35] tried to fix indices of nodes before applying 1-D pooling operations by using binary tree indexing for graph coarsening. Simonovsky and Komodakis [36] used deterministic graph clustering algorithm to determine pooling patterns. Ying et al. [37] proposed DiffPool which used an assignment matrix to achieve pooling by assigning nodes to different clusters of the next layer. In graph U-Net [30], Gao and Ji proposed novel graph pooling (gPool) operation, they set a global projection vector, and the projection of node feature vectors is regarded as the importance degree. Then, the importance degree of nodes is used to make a contraction transformation of node features, which further enhances the gradient learning of nodes with higher importance degree. The gPool layer adopts the method of discarding nodes layer by layer to improve the efficiency of information aggregation of long-distance nodes. This is an important implementation of pooling operations on graph-structured data based on TopK scheme. Correspondingly, an unpooling operation named gUnpool is proposed in graph U-Net, which restores the graph to its original structure with the help of locations of nodes selected in the corresponding gPool layer.
The pooling for downsampling and unpooling for upsampling operations are the key to the U-Net architecture. Based on the gPool and gUnpool layers mentioned above, Gao and Ji developed the graph U-Net architecture [30], which allows high-level feature encoding and decoding for network embedding. In the encoder part, each encoding block contains a gPool layer followed by a GCN layer. gPool layers reduce the size of feature map to encode highorder features for coarsened graph, while GCN layers passes and aggregates information from the first-order of each node. In the decoder part, there is the same number of decoding blocks as in the encoder part. Each decoder block consists of a gUnpool layer and a GCN layer, which gUnpool layer restores the graph into its original structure with higher resolution. Besides, there are skip-connections like residual connections between corresponding blocks of encoder and decoder parts, which transmit spatial information from encoding blocks to decoding blocks for better performance. Figure 2 provides the architecture of graph U-Net and instructions of gPool/gUnpool layers. Moreover, the graph U-Net designs two tricks to improve the performance. One trick is to augment the connectivity of graph by graph power operation, since some less important nodes and related edges are removed through gPool layers, the nodes in the coarsened graph might become isolated that may influence the information propagation and aggregation in subsequent GCN layers. The graph U-Net uses graph power 2 ∶ A 2 = A A to build links between nodes whose distances are at most 2 hops, and select the nodes A +1 = A 2 (idx, idx) through the TopK ranking result of gPool layer as shown in Fig. 2b. The other one is the improved GCN layer, which means the adjacency matrix before normalization is computed as Â = A + 2I by commonly imposing larger weights on self loops in the graph processing. Although the graph U-Net achieves the encoder-decoder architecture with two small operation tricks on graph data, it still has two limitations. First, the optimal depth of an encoder-decoder network can vary due to applications, depending on the task difficulty and complexity. This uncertain situation is inefficient from a deployment perspective, because these networks do not share a common encoder. Second, the design of skip connections used in an encoder-decoder network is unnecessarily restrictive, demanding the fusion of the samescale encoder and decoder feature maps. The graph U-Net handicapped by unnecessarily restrictive skip connections where only the same-scale feature maps from the encoder and decoder can be fused. While the same-scale feature maps from the decoder and encoder networks are semantically dissimilar. Thus, there is a need for a redesigned skip connection way which presents feature maps of varying scales at a decoder node, allowing the aggregation layer to decide how various feature maps carried along the skip connections should be fused with the decoder feature maps, such as the dense connection way. In addition to the graph tasks for classification described above, graph representation can also be applied to social network analysis and community detection and recommendation tasks [38,39]. A community reveals the characteristics of its members that are different from other communities in the network and detection of connected communities is of great significance in network analysis. The community detection applications usually contains community deception, community search, recommendation systems and online social network analysis. Fig. 2 An illustration of the graph U-Net architecture and its encoder-decoder parts. a In this example, each node in the input graph has 2 features for node classification task. It has 2 encoding/decoding blocks and 2 skip-connections. b The graph pooling layer considers the situation of a graph with 4 nodes, and each node has 5 features. After pool with 0.5 pooling ratio, the coarsened graph has 2 nodes. c For example, a graph with 7 nodes is down-sampled, resulting in a coarsened graph with 4 nodes and position information of selected nodes. The corresponding gUnpool layer uses the position information to reconstruct the original graph structure by distributing empty feature vectors to unselected nodes 1 3

Graph U-Net+
In this section, we use graph U-Net for reference and propose the graph U-Net+ architecture for social behavior prediction (node binary classification) task. Based on this novel U-Net-like architecture, we introduce some improvements, including improved encoding/decoding blocks, better skip-connection way, and deep supervision mechanism.

Graph U-Net+ architecture
It is well-known that encoder-decoder structure networks like graph U-Net has achieved promising performance on node-wise prediction tasks, since they can encode and decode high-level features while maintaining local spatial information. The down-sampling operation gPool is adopted to increase the robustness of the model, such as subtle changes in the graph structure and signal noises, and reduce the risk of over-fitting. The size of the graph is compressed to reduce the computation. In this way, the graph U-Net cannot only capture the features of shallow layers, but also capture the deep layer features caused by the gradual downward expansion of receptive field and graph convolution.
In order to make full use of the respective advantages of graph U-Net networks at different depths to capture features at different levels, we stack all graph U-Net networks with depths 1 to 4 together, on the basis of keeping the gPool and gUnpool layers. Compared with the original graph U-Net, this network's main body fills up the original hollow U-structure part. This architecture takes advantages of the powerful capability of different subsets of network depth and learns the important features of different depths. More importantly, multiple graph U-Net networks within this architecture actually share the same feature extraction encoder part, so it only needs to train all the encoding blocks in the same downsampling process, and then the features at different levels are restored by different up-sampling paths, which greatly reduces the number of training parameters. It is proved that this architecture is not simply a stack version of graph U-Nets. However, if the original long skip-connection way in graph U-Net is not changed, such an architecture cannot be trained, because except the deepest decoder part, the other shallow decoder parts are disconnected from crossentropy function during the back propagation, and no gradient information will be propagated through this region (the red area in Fig. 3).
The skip-connection in graph U-Net is similar to the residual-connection way, so referring to DenseNet's [40] experience in improving the classification performance of ResNet [41], we introduce the classic dense-connection way. Thus, our graph U-Net+ has both long and short connection ways, which increase the reuse of multi-scale and high-order features, and ensure that the network could be trainable. In addition, we also introduce a deep supervision mechanism, which cannot only improve the performance of the graph U-Net+, but also solve the problem that the gradient cannot propagate back. The specific method is to add a linear layer after the last decoding block of each decoder, and then transmit output directly participate in the calculation of loss function, which is equivalent to supervising the output of the sub graph U-Net of each branch. The deep supervision mechanism integrates the multi-level outputs of graph U-Net+ to obtain better node classification performance. In addition, all the four output parts participate in the training stage, while in the testing stage, because the whole network has been trained well, there is only the forward transmission process, so we can compare which of the four output parts has the best performance and prune the other "branches". This operation is convenient for corporations to select the appropriate size based on the pre-trained model according to the actual task requirements, thus improving the scalability of our proposed graph U-Net+ in different environments. Figure 4 shows the complete graph U-Net+ architecture with designed skip-connection way and deep supervision mechanism.

Improved information aggregation layer BGNN
We introduce the GCNII layer and bilinear information aggregator layer, and combine them into the improved information aggregation layer of graph U-Net+'s encoding/decoding blocks. We improve the GCN layer of graph U-Net to GCNII layer [31], the formula of forward propagation is defined as: where P =D − 1 2ÂD − 1 2 represents the normalized graph Laplacian matrix; Â = A + 2I is retained to highlight the features of node itself during information propagation and aggregation process; I N is the identity matrix which N denotes the number of nodes; W is the learnable parameter matrix; and are the hyperparameters that control the initial residual connection and the proportion of identity mapping respectively. ∈ [0, 1] represents that each node output by the GCNII layer contains at least a part of the input feature X 0 , where X 0 refers to the result after dimension reduction through the input linear layer. The principle of setting is to ensure that the attenuation rate of weight matrix increases adaptively as the network stacks more layers. In practice, = log 1 + ≈ , where is a hyperparameter. Compared with the GCN layer in graph U-Net, the GCNII layer can alleviate the over-smoothing problem caused by the deepening of the graph network, so as to improve the capability of nodes' representation learning process. In order to mine and utilize neighbor interactions information between nodes for better representation, the bilinear information aggregator layer [32] is introduced in this paper. Common GCN or GCNII layer can be boiled down to a weighted summation of information by a node over its neighbors, may sometimes loss between nodes of some common properties, and the bilinear information aggregator can capture and strengthen the weak signals by dot product operation, such as the interaction information between neighbors, the common characteristics between friends. The forward propagation formula of bilinear information aggregator is defined as: represents the interaction times of node v i , and the obtained node representation can be normalized to eliminate the deviation caused by node degree. This information aggregation method like bilinear information aggregator layer emphasizes the interaction/contact information between neighbor nodes, i.e. the common features, while dilutes and weakens the difference information.
In this paper, we combine GCNII and bilinear information aggregator together, and construct a more powerful information aggregation layer BGNN: where the hyperparameter ∈ [0, 1] is used to control the aggregation ratio of the outputs produced by two information aggregators mentioned above. When is 0, the graph U-Net+ does not consider the interaction information between nodes, and the information aggregation layer degenerates to the single GCNII layer. When becomes 1, the information aggregation layer only uses bilinear information aggregators.

Encoding and decoding blocks
It is well known that the batch normalization is added into CNN based networks during image processing to prevent the over-fitting problem. In our proposed graph U-Net+, following the information aggregation layer BGNN, we introduce the graph node normalization layer NodeNorm [33] into down/up-sampling blocks, which is used to alleviate the oversmoothing problem and improve the node classification performance. NodeNorm layer defines the node-level bias to measure the feature variance of a single node. Taking node v i as an example, its feature variance at the layer is: where x ij represents the jth feature in the representation vector of node v i ; d denotes the dimension size of the representation vector of node v i ; i = 1 d ∑ d j=1 x ij is the mean of all features of node v i . We hope that in each encoding and decoding block, after information aggregation layer, all nodes can find a suitable representation in a latent embedding space. Through the operation of centralization and then stretching, the new embedding representation of node v i is defined as: where i = √ var i is the standard deviation of the node v i . In the node classification task, the nodes with similar features often belong to the same category, NodeNorm makes the location of node in latent space more close, which means that the distance of similar nodes is smaller, and the classification effect is better.
Based on gPool, gUnpool, information aggregation layer BGNN, and graph node normalization layer NodeNorm, we propose our encoding/decoding blocks of graph U-Net+. As shown in Fig. 5, an encoding block contains a gPool layer followed by a BGNN layer and a NodeNorm layer. gPool layer reduce the graph size, while BGNN layer are responsible for aggregating information from each node's neighborhood information and NodeNorm computes a new appropriate representation for nodes in latent embedding space to avoid over-smoothing problem. Each decoding block is composed of a BGNN layer, a NodeNorm layer and a gUnpool layer. The pseudocode Algorithm 1 of entire program running is shown in Appendix, it is an instruction of network training, validating and testing process.

Experimental settings
In this section, we give the details of datasets and evaluation experiments. We set up our experiments with large-scale realworld datasets to quantitatively evaluate the proposed graph U-Net+ model. We first introduce these four datasets, then introduce the details of input data. After that, we introduce the evaluation metrics and display the experimental settings.

Datasets
Our experiments are conducted on the following social media and co-author citation networks: Digg, Twitter, Weibo, and Open-Academic-Graph (OAG). Table 1 lists all statistics of four datasets mentioned above. In this paper, we only consider the undirected relationship (undirected graph) situation.
Digg [42]: Digg 1 is a website that collects articles and stories. It allows users to vote on contents (that is, each story or article) according to personal preference and opinions of their online friends, and the stories that receive more votes will be recommended to the website's home page. This dataset contains history voting data for stories that were promoted to the Digg home page during a month in 2009. For each story, it contains a list of users who have voted on this story up to the data collection time, as well as the UNIX timestamp that each vote occurred, and friend relationships between voters are also retrieved. Digg relationship network has 279,630 nodes and 1,548,126 edges. The social behavior is defined  Twitter [43]: On July 4, 2012, scientists announced on Twitter that they had discovered a new particle with the features of the elusive Higgs boson. The Stanford team monitored and recorded the news spreading through Twitter 2 platform. This dataset records the history retweeting of this news among users, including the users' friend relationship network which has 456,626 nodes and 12,508,418 edges. The social behavior is defined as whether the user was involved in retweeting this news. This dataset specifies 362,888 users as target ego-users.
Weibo [44,45] OAG: OAG 4 (Open Academic Graph) dataset is generated by connecting two large academic graph networks: MAG (Microsoft Academic Graph) [46] and AMiner [47], including 20 popular conferences from data mining, information retrieval, machine learning, natural language processing, computer vision, and database research communities. 5 The social network is defined to be the co-author citation network, in which 953,675 nodes represent scholars, 4,151,463 edges represent collaborative relationships among them. The social behavior is defined to be citation actions-a researcher cites a paper from the above conferences. For the entire OAG network, dataset designates 499,848 nodes as target ego-users. We want to explore how one's citation behaviors are influenced by his or her collaborators.
In this paper, the proportions of the above datasets in training/validation/testing sets are 75% , 12.5% and 12.5% respectively. Table 2 shows the number of target ego-users after specific division.

Data preparation
Each dataset mentioned above consists of two main parts: adjacency matrix and node representation matrix. The preprocess is described below. Adjacency matrix: As shown in Table 1, each dataset has N target ego-users. Because our goal is to predict the future behavior of these target ego-users, so we focus on the scope of target node and its local neighborhood. For one target ego-user, random walk with restart (RWR) algorithm starts from this node, and iteratively travels to its neighborhood with the probability that is proportional to the weight of each edge. Besides, at each step, the walk is assigned a probability 20% to return to the target ego-user or its active neighbor. The RWR runs until it successfully collects 50 nodes from social network, including target ego-user. After sampling, the entire social media network becomes N subgraphs. Each subgraph is defined as an undirected graph type which edge weight is 1.
Node representation matrix: The size of this matrix is (N × 50 × 73) , each row represents a 73 dimensional feature vector of one node. Figure 6 shows the components of node representation features. The first yellow part is obtained by the network embedding algorithm DeepWalk [19]. The second green part comes from the characteristics of network, and these social influence metrics measures the impact of nodes which objectively reflect the importance of nodes in the network, the structure information of the network and some attributes of nodes themselves. The last blue part contains two hand-crafted features, one represents whether the user is the ego-user (0∕1) , and the other one represents the behavior status (0∕1) of user.

Evaluation metrics and baselines
Evaluation metrics: To evaluate the performance of our proposed graph U-Net+ model in the social media user behavior prediction task, we utilize four standard metrics which are usually used in the classification task. Precision (Prec.) reflects the percentage of the samples that the model actually predicted correctly. Recall (Rec.) refers to how many positive instances in samples are predicted correctly, reflecting the success rate of model classification. F1 (F1-score) is usually used in the binary classification task, it is the weighted harmonic average of precision and recall. AUC (Area Under Curve) is also a standard evaluation metric in binary classification task, it refers to the area enclosed by the coordinate axis under the ROC curve, which is plotted with FPR as the horizontal axis and TPR as the vertical axis.
Baselines: We compare our proposed graph U-Net+ approach with other five state-of-the-art approaches, including GCN and GAT based DeepInf [17,23,25], PPNP and APPNP [26,27], and graph U-Net [30]. The results of GCN and GAT based DeepInf, PPNP and APPNP are directly cited from existing papers [25,26]. We further reproduce and test the graph U-Net model on all datasets, and the results are saved for performance comparison.

Hyper-parameter settings and implementation details
We build a graph U-Net+ framework which contains all modules mentioned in Sect. 3 with a depth of 4, and test its performance on four datasets. Each node has an initial feature dimension length of 73, so we add a linear layer with the size of 73 × 64 as input layer (grey block). The size of parameter matrix used for the feature dimension transformation of information aggregation layer in each downsampling encoder module is 64 × 64 . Correspondingly, in the upsampling decoder module, the size of parameter matrix is set to be 64 × 64 , since the skip connection adopts the summation way. Before the output is sent into the crossentropy loss function, we set another linear layer as the output layer. Due to the task is a binary classification problem, so the size of this linear layer is set to be 64 × 2 (grey block). For detailed experiment configuration, we adopt Mish [48] as nonlinearity activation function. All the parameters are initialized with

3
Glorot initialization [49] and trained by Adam optimizer [50] with learning rate 0.01, weight decay 5e −4 , dropout rate 0.2, 0.2 and 0.5 for GCNII layer, pooling ratio 0.5 for graph pooling layer, 0.3 for each information aggregation layer of encoding and decoding blocks. We set random seed to be 42. After the data was shuffled, we divide all subgraphs in each dataset into training, validation and test at the proportions of 75% , 12.5% , 12.5% , respectively. We allow our model to run at most 1000 epochs over the training data, and the best model was selected by early stopping strategy with 100 patience by comparing loss on the validation sets. The mini-batch size is set to be 1024 across all datasets.
All designed experiments are based on the Pytorch1.6.0 under Linux Ubuntu16.04 system environment. The hardware configurations of the experimental platform are IntelⓇ XeonⓇ E5-2680 v4 @ 2.40GHz CPU, and Nvidia GTX2080Ti GPU with 11019MiB memory. We also implement a graph based deep learning library called Pytorch-Geometric which helps improving the performance and reducing time.

Results and analysis
In this section, we compare our proposed graph U-Net+ with previous state-of-the-art models on social behavior prediction (node classification) task. Experimental results show that our method achieve promising results on four common large real-world datasets and has good capability to overcome the over-smoothing problem. Some ablation studies are performed to examine the contributions of the improvements proposed in Sect. 3. The bold values represent the best performance in that column of metrics. It is obviously that our model's performance is better than other approaches in most situations Both graph U-Net and graph U-Net+ have a depth of 4. '-' indicates that the experimental results cited are missing in paper [26] ( %)

Performance study
We conduct a verification experiment on model validity, and present the AUC, precision, recall, and F1 results of graph U-Net+ with the other compared approaches, such as GCN-based and GAT-based DeepInf, PPNP and APPNP, and graph U-Net, on four large real-world datasets in Table 3. Both graph U-Net and graph U-Net+ have a depth of 4. We can observe from the results that our proposed graph U-Net+ achieves better performance than other networks. Especially on two key metrics AUC and F1, the graph U-Net+ outperforms than other methods consistently. In this experiment, except for the result of graph U-Net which is reconstructed by us, other results are directly cited from these two paper [25,26]. Our proposed model is composed of GCNII and bilinear information aggregator module without involving more advanced graph convolution layers like GAT, but AUC ad F1 metrics on four datasets show that graph U-Net+ performs better than graph self-attention based DeepInf-GAT. This is because the encoder-decoder architecture has better performance in pixel-level or node-level classification. Similarly, compared to the GCN-based methods DeepInf-GCN, APPNP and graph U-Net, our graph U-Net+ not only has the advanced network architecture, but also because we improve the graph convolution approach in encoding and decoding blocks. The additional information aggregation mechanism is introduced, so that the fusion process of node information is more reasonable and robust. In the other hand, the graph normalization technique can also greatly inhibit the occurrence of oversmoothing problems. As for the less performance of precision and recall metrics on OAG dataset, we believe that the reason why graph U-Net+ model proposed in this paper performs poorly on OAG dataset may be that the other three datasets are all standard social networks based on the real world media, in which user interaction and friend relationship are more close to the research interest of this paper. The OAG dataset is a "social network"-like dataset extracted from the citation and co-author networks, their network structures are totally different, so the precision and recall metrics did not outperform other methods, but the more convincing AUC and F1 metric proved the effectiveness of our method. Also in the graph node classification tasks, researchers mainly focus on the AUC and F1 metrics. F1 takes into account both the precision and recall rates of the classification model, which can be regarded as the weighted sum of the precision and recall rates of the model. Therefore, higher F1 values represent better model performance. Specially, the unique architecture design and the use of deep supervision mechanism leads to marked improvement over graph U-Net. When compared to graph U-Net directly through AUC, our graph U-Net+ significantly improves performance on all four datasets by margin of 4.11% , 4.33% , 6.96% , and 3.29% , respectively. These results demonstrate the effectiveness of graph U-Net+ in node classification and social behavior prediction task.

Ablation study
In this section, we investigate the contributions of five improvements to the performance of graph U-Net+. In order to study the effectiveness of proposed network architecture, we conduct experiment by keeping all GCN, gPool, and gUnpool layers within graph U-Net+. Note that the only difference between our graph U-Net+ and graph U-Net is the structure of networks. Table 4 shows the results that illustrate graph U-Net+ have better effectiveness than graph U-Net, our designed architecture is a better version of encoder-decoder architecture. When considering the difference between the two models in terms of architecture, graph U-Net+ enable higher level feature encoding thereby resulting in better performance due to the use of deep-supervision mechanism and dense-like skip connection way. On the basis of experiment conducted in Tables 4, 5 provides the comparison results between graph U-Net+ with GCN or GCNII layers. We keep the gPool and gUnpool layers unchanged, and only change the GCN layer to GCNII layer. The results show that graph U-Net+ architecture with GCNII has better performance over graph U-Net+ with GCN by margins of average 1.54% on AUC and 1.50% on F1 across all datasets. Note that the graph U-Net+ architecture in Table 5 dose not contain bilinear information aggregator module and NodeNorm layer.
The following two experiments of Tables 6 and 7 are designed to verify the functional validity of the bilinear information aggregator and NodeNorm layer we introduced in graph U-Net+. The results show that these two layers with GCNII module are helpful to improve the prediction performance and reduce the effect of over-smoothing problem.
In the last experiment, we try to verify the effectiveness of the deep supervision mechanism added to graph U-Net+ by removing other "branch" output connecting to the loss function. We only keep output of the deepest decoder, because it carries much more information contributed by encoding blocks through skip-connection way. The version of "without Deep Supervision" represents that graph U-Net+ only contains the output after linear layer of encoder with longest path. The results in Table 8 provide the evidence that graph U-Net+ with deep supervision mechanism performs better by margins of average 0.81% on AUC and 0.46% on F1 across all datasets. All experimental results yields significantly performance improvement.

Conclusion
In this work, we study the social media user behavior prediction problem. This paper explores and enriches more possibilities of behavior prediction model, and put forward graph U-Net+ model with more complex structure but better performance on the basis of graph U-Net. We retain the both gPool and gUnpool operations from graph U-Net. Our graph U-Net+ architecture can be thought of a stacked version of graph U-Net at different depths, and change its skip connection's way from residual to dense. These two changes help model absorb more features while restoring and retaining as much of the original and high-level information as possible. In order to make the model trainable, we also add a deep supervision mechanism to the final loss function. Furthermore, we expand and enrich the internal structure of encoder and decoder. The original graph U-Net only contains a single graph convolution in both encoder and decoder blocks, but in graph U-Net+, we replace it with a combination of GCNII and bilinear information aggregator modules to enrich the learning capability of node representation. The graph normalization technique has also been introduced into our encoder and decoder to help our model overcomes the over-smoothing problem. We test our model on four social media and co-author citation networks: Digg, Twitter, Weibo and OAG. Experimental results demonstrate that our proposed graph U-Net+ has achieved encouraging performance and significantly outperforms previous methods.