We are interested in measuring two main effects: first, the variation in model performance as we increase the overlap in the embeddings, and second, the capacity of the embeddings with overlap (versus no overlap) to capture and benefit from dependencies between graph structure and node features. To that end, we train overlapping and non-overlapping models on synthetic data with different degrees of correlation between network structure and node features.
Synthetic featured networks
We use a Stochastic Block Model (Holland et al. 1983) to generate synthetic featured networks, each with M communities of n=10 nodes, with intra-cluster connection probabilities of 0.25, and with inter-cluster connection probabilities of 0.01. Each node is initially assigned a colour which encodes its feature community; we shuffle the colours of a fraction 1−α of the nodes, randomly sampled. This procedure maintains constant the overall count of each colour, and lets us control the correlation between the graph structure and node features by moving α from 0 (no correlation) to 1 (full correlation).
Node features are represented by a one-hot encoding of their colour (therefore, in all our scenarios, the node features have dimension M=N/n). However, since in this case all the nodes inside a community have exactly the same feature value, the model can have difficulties differentiating nodes from one another. We therefore add a small Gaussian noise (σ=.1) to make sure that nodes in the same community can be distinguished from one another.
Note that the feature matrix has less degrees of freedom than the adjacency matrix in this setup, a fact that will be reflected in the plots below. However, opting for this minimal generative model lets us avoid the parameter exploration of more complex schemes for feature generation, while still demonstrating the effectiveness of our model.
Comparison setup
To evaluate the efficiency of our model in terms of capturing meaningful correlations between network and features, we compare overlapping and non-overlapping models as follows. For a given maximum number of embedding dimensions Fmax, the overlapping models keep constant the number of dimensions used for adjacency matrix reconstruction and the number of dimensions used for feature reconstruction, with the same amount allocated to each task: \(F^{ov}_{\mathbf {A}} + F^{ov}_{\mathbf {AX}} = F^{ov}_{\mathbf {X}} + F^{ov}_{\mathbf {AX}} = \frac {1}{2} F_{max}\). However they vary the overlap \(F^{ov}_{\mathbf {AX}}\) from 0 to \(\frac {1}{2} F_{max}\) by steps of 2. Thus the total number of embedding dimensions F varies from Fmax to \(\frac {1}{2} F_{max}\), and as F decreases, \(F^{ov}_{\mathbf {AX}}\) increases. We call one such model \(\mathcal {M}^{ov}_{F}\).
Now for a given overlapping model \(\mathcal {M}^{ov}_{F}\), we define a reference model \(\mathcal {M}^{ref}_{F}\), which has the same total number of embedding dimensions, but without overlap: \(F^{ref}_{\mathbf {AX}} = 0\), and \(F^{ref}_{\mathbf {A}} = F^{ref}_{\mathbf {X}} = \frac {1}{2} F\) (explaining why we vary F with steps of 2). Note that while the reference model has the same information bottleneck as the overlapping model, it has less trainable parameters in the decoder, since \(F^{ref}_{\mathbf {A}} + F^{ref}_{\mathbf {AX}} = F^{ref}_{\mathbf {X}} + F^{ref}_{\mathbf {AX}} = \frac {1}{2} F\) will decrease as F decreases. Nevertheless, this will not be a problem for our measures, since we will be mainly looking at the behaviour of a given model for different values of α (i.e. the feature-network correlation parameter).
For our calculations (if not noted otherwise) we use synthetic networks of N=1000 nodes (i.e. 100 clusters), and set the maximum embedding dimensions Fmax to 20. For all models, we set the intermediate layer in the encoder and the two intermediate layers in the decoder to an output dimension of 50, and the internal number of samples for loss estimation at K=5. We train our models for 1000 epochs using the Adam optimiser (Kingma and Ba 2014) with a learning rate of 0.01 (following Kipf and Welling 2016b), after initialising weights following Glorot and Bengio (2010). For each combination of F and α, the training of the overlapping and reference models is repeated 20 times on independent featured networks.
Since the size of our synthetic data is constant, and we average training results over independently sampled data sets, we can meaningfully compare the averaged training losses of models with different parameters. We therefore take the average best training loss of a model to be our main measure, indicating the capacity to reconstruct an input data set for a given information bottleneck and embedding overlap.
Advantages of overlap
Absolute loss values
Figure 2 shows the variation of the best training loss (total loss, adjacency reconstruction loss, and feature reconstruction loss) for both overlapping and reference models, with α ranging from 0 to 1 and F decreasing from 20 to 10 by steps of 2. One curve in these plots represents the variation in losses of a model with fixed F for data sets with increasing correlation between network and features; each point aggregates 20 independent trainings, used to bootstrap 95% confidence intervals.
We first see that all losses, whether for overlapping model or reference, decrease as we move from the uncorrelated scenario to the correlated scenario. This is true despite the fact that the total loss is dominated by the adjacency reconstruction loss, as feature reconstruction is an easier task overall. Second, recall that the decoder in a reference model has less parameters than its corresponding overlapping model of the same F dimensions (except for zero overlap), such that the reference is less powerful and produces higher training losses. The absolute values of the losses for overlap and reference models are therefore not directly comparable. However, the changes in slopes are meaningful. Indeed, we note that the curve slopes are steeper for models with higher overlap (lower F) than for lower overlap (higher F), whereas they seem relatively independent for the reference models of different F. In other words, as we increase the overlap, our models seem to benefit more from an increase in network-feature correlation than what a reference model benefits.
Relative loss disadvantage
In order to assess this trend more reliably, we examine losses relative to the maximum embedding models. Figure 3 plots the loss disadvantage that overlap and reference models have compared to their corresponding model with F=Fmax, that is, \(\frac {\mathcal {L}_{\mathcal {M}_{F}} - \mathcal {L}_{\mathcal {M}_{F_{max}}}}{\mathcal {L}_{\mathcal {M}_{F_{max}}}}\). We call this the relative loss disadvantage of a model. In this plot, the height of a curve thus represents the magnitude of the decrease in performance of a model \(\mathcal {M}^{ov|ref}_{F}\) relative to the model with maximum embedding size, \(\mathcal {M}^{ov|ref}_{F_{max}}\). Note that for both the overlap model and the reference model, moving along one of the curves does not change the number of trainable parameters in the model.
As the correlation between network and features increases, we see that the relative loss disadvantage decreases in overlap models, and that the effect is stronger for higher overlaps. In other words, when the network and features are correlated, the overlap captures this joint information and compensates for the lower total number of dimensions (compared to \(\mathcal {M}^{ov|ref}_{F_{max}}\)): the model achieves a better performance than when network and features are more independent. Strikingly, for the reference model these curves are flat, thus indicating no variation in relative loss disadvantage with varying network-feature correlations in these cases. This confirms that the new measure successfully controls for the baseline decrease of absolute loss values when the network-features correlation increases, as observed in Fig. 2. Our architecture is therefore capable of capturing and taking advantage of some of the correlation by leveraging the overlap dimensions of the embeddings.
Finally note that for high overlaps, the feature reconstruction loss value actually increases a little when α grows. The behaviour is consistent with the fact that the total loss is dominated by the adjacency matrix loss (the hardest task). In this case it seems that the total loss is improved more by exploiting the gain of optimising for adjacency matrix reconstruction, and paying the small cost of a lesser feature reconstruction, than decreasing both adjacency matrix and feature losses together. If wanted, this strategy could be controlled using a gradient normalisation scheme such as Chen et al. (2018).
Standard benchmarks
Finally we compare the performance of our architecture to other well-known embedding methods, namely spectral clustering (SC) (Tang and Liu 2011), DeepWalk (DW) (Perozzi et al. 2014), the vanilla non-variational and variational Graph Autoencoders (GAE and VGAE) (Kipf and Welling 2016b), and GraphSAGE (Hamilton et al. 2017) which we look at in more detail. We do so on two tasks: (i) the link prediction task introduced by Kipf and Welling (2016b) and (ii) a node classification task, both on the Cora, CiteSeer and PubMed datasets, which are regularly used as citation network benchmarks in the literature (Sen et al. 2008; Namata et al. 2012). Note that neither SC nor DW support feature information as an input.
The Cora and CiteSeer datasets are citation networks made of respectively 2708 and 3312 machine learning articles, each assigned to a small number of document classes (7 for Cora, 6 for CiteSeer), with a bag-of-words feature vector for each article (respectively 1433 and 3703 words). The PubMed network is made of 19717 diabetes-related articles from the PubMed database, each assigned to one of three classes, with article feature vectors containing term frequency-inverse document frequency (TF/IDF) scores for 500 words.
Link prediction
The link prediction task consists in training a model on a version of the datasets where part of the edges has been removed, while node features are left intact. A test set is formed by randomly sampling 15% of the edges combined with the same number of random disconnected pairs (non-edges). Subsequently the model is trained on the remaining dataset where 15% of the real edges are missing.
We pick hyperparameters such that the restriction of our model to VGAE would match the hyperparameters used by Kipf and Welling (2016b). That is a 32-dimensions intermediate layer in the encoder and the two intermediate layers in the decoder, and 16 embedding dimensions for each reconstruction task (FA+FAX=FX+FAX=16). We call the zero-overlap and the full-overlap versions of this model AN2VEC-0 and AN2VEC-16 respectively. In addition, we test a variant of these models with a shallow adjacency matrix decoder, consisting of a direct inner product between node embeddings, while keeping the two dense layers for feature decoding. Formally: \(A_{ij} | \boldsymbol {\xi }_{i}, \boldsymbol {\xi }_{j} \sim \text {Ber}(\text {sigmoid}(\boldsymbol {\xi }^{T}_{i} \boldsymbol {\xi }_{j}))\). This modified overlapping architecture can be seen as simply adding the feature decoding and embedding overlap mechanics to the vanilla VGAE. Consistently, we call the zero-overlap and full-overlap versions AN2VEC-S-0 and AN2VEC-S-16.
We follow the test procedure laid out by Kipf and Welling (2016b): we train for 200 epochs using the Adam optimiser (Kingma and Ba 2014) with a learning rate of.01, initialise weights following Glorot and Bengio (2010), and repeat each condition 10 times. The μ parameter of each node’s embedding is then used for link prediction (i.e. the parameter is put through the decoder directly without sampling), for which we report area under the ROC curve and average precision scores in Table 1.Footnote 9
Table 1 Link prediction task in citation networks We argue that AN2VEC-0 and AN2VEC-16 should have somewhat poorer performance than VGAE. These models are required to reconstruct an additional output, which is not directly used for the link prediction task at hand. First results confirmed our intuition. However, we found that the shallow decoder models AN2VEC-S-0 and AN2VEC-S-16 perform consistently better than the vanilla VGAE for Cora and CiteSeer while their deep counterparts (AN2VEC-0 and AN2VEC-16) outperforms VGAE for all datasets. As neither AN2VEC-0 nor AN2VEC-16 exhibited over-fitting, this behaviour is surprising and calls for further explorations which are beyond the scope of this paper (in particular, this may be specific to the link prediction task). Nonetheless, the higher performance of AN2VEC-S-0 and AN2VEC-S-16 over the vanilla VGAE on Cora and CiteSeer confirms that including feature reconstruction in the constraints of node embeddings is capable of increasing link prediction performance when feature and structure are not independent (consistent with Gao and Huang 2018; Shen et al. 2018; Tran 2018). An illustration of the embeddings produced by AN2VEC-S-16 on Cora is shown in Fig. 4.
On the other hand, performance of AN2VEC-S-0 on PubMed is comparable with GAE and VGAE, while AN2VEC-S-16 has slightly lower performance. The fact that lower overlap models perform better on this dataset indicates that features and structure are less congruent here than in Cora or CiteSeer (again consistent with the comparisons found in Tran 2018). Despite this, an advantage of the embeddings produced by the AN2VEC-S-16 model is that they encode both the network structure and the node features, and can therefore be used for downstream tasks involving both types of information.
We further explore the behaviour of the model for different sizes of the training set, ranging from 10% to 90% of the edges in each dataset (reducing the training set accordingly), and compare the behaviour of AN2VEC to GraphSAGE. To make the comparison meaningful we train two variants of the two-layer GraphSAGE model with mean aggregators and no bias vectors: one with an intermediate layer of 32 dimensions and an embedding layer of 16 dimensions (roughly equivalent in dimensions to the full overlap AN2VEC models), the second with an intermediate layer of 64 dimensions and an embedding layer of 32 dimensions (roughly equivalent to no overlap in AN2VEC). Both layers use neighbourhood sampling, 10 neighbours for the first layer and 5 for the second. Similarly to the shallow AN2VEC decoder, each pair of node embeddings is reduced by inner product and a sigmoid activation, yielding a scalar prediction between 0 and 1 for each possible edge. The model is trained on minibatches of 50 edges and non-edges (edges generated with random walks of length 5), learning rate 0.001, and 4 total epochs. Note that on Cora, one epoch represents about 542 minibatches,Footnote 10 such that 4 epochs represent about 2166 gradient updates; thus with a learning rate of 0.001, we remain comparable to the 200 full batches with learning rate 0.01 used to train AN2VEC.
Figure 5 plots the AUC produced by AN2VEC and GraphSAGE for different training set sizes and different embedding sizes (and overlaps, for AN2VEC), for each dataset. As expected, the performance of both models decreases as the size of the test set increases, though less so for AN2VEC. For Cora and CiteSeer, similarly to Table 1, higher overlaps and a shallow decoder in AN2VEC give better performance. Notably, the shallow decoder version of AN2VEC with full overlap is still around.75 for a test size of 90%, whereas both GraphSAGE variants are well below.65. For PubMed, as in Table 1, the behaviour is different to the first two datasets, as overlaps 0 and 16 yield the best results. As for Cora and CiteSeer, the approach taken by AN2VEC gives good results: with a test size of 90%, all AN2VEC deep decoder variants are still above.75 (and shallow decoders above.70), whereas both GraphSAGE variants are below.50.
Node classification
Since the embeddings produced also encode feature information, we then evaluate the model’s performance on a node classification task. Here the models are trained on a version of the dataset where a portion of the nodes (randomly selected) have been removed; next, a logistic classifierFootnote 11 is trained on the embeddings to classify training nodes into their classes; finally, embeddings are produced for the removed nodes, for which we show the F1 scores of the classifier.
Figure 6 shows the results for AN2VEC and GraphSAGE on all datasets. The scale of the reduction in performance as the test size increases is similar for both models (and similar to the behaviour for link prediction), though overlap and shallow versus deep decoding seem to have less effect. Still, the deep decoder is less affected by the change in test size than the shallow decoder; and contrary to the link prediction case, the 0 overlap models perform best (on all datasets). Overall, the performance levels of GraphSAGE and AN2VEC on this task are quite similar, with slightly better results of AN2VEC on Cora, slightly stronger performance for GraphSAGE on CiteSeer, and mixed behaviour on PubMed (AN2VEC is better for small test sizes and worse for large test sizes).
Variable embedding size
We also explore the behaviour of AN2VEC for different embedding sizes. We train models with FA=FX∈{8,16,24,32} and overlaps 0, 8, 16, 24, 32 (whenever there are enough dimensions to do so), with variable test size. Figure 7 shows the AUC scores for link prediction, and Fig. 8 shows the F1-micro scores for node classification, both on CiteSeer (the behaviour is similar on Cora, though less salient). For link prediction, beyond confirming trends already observed previously, we see that models with less total embedding dimensions perform slightly better than models with more total dimensions. More interestingly, all models seem to reach a plateau at overlap 8, and then exhibit a slightly fluctuating behaviour as overlap continues to increase (in models that have enough dimensions to do so). This is valid for all test sizes, and suggests (i) that at most 8 dimensions are necessary to capture the commonalities between network and features in CiteSeer, and (ii) that having more dimensions to capture either shared or non-shared information is not necessarily useful. In other words, 8 overlapping dimensions seem to capture most of what can be captured by AN2VEC on the CiteSeer dataset, and further increase in dimensions (either overlapping or not) would capture redundant information.
Node classification, on the other hand, does not exhibit any consistent behaviour beyond the reduction in performance as the test size increases. Models with less total dimensions seems to perform slightly better at 0 overlap (though this behaviour is reversed on Cora), but neither the ordering of models by total dimensions nor the effect of increasing overlap are consistent across all conditions. This suggests, similarly to Fig. 6a, that overlap is less relevant to this particular node classification scheme than it is to link prediction.
Memory usage and time complexity
Finally, we evaluate the resources used by our implementation of the method in terms of training time and memory usage. We use AN2VEC with 100-dimensions intermediate layers in the encoder and the (deep) decoder, with 16 embedding dimensions for each reconstruction task (FA+FAX=FX+FAX=16), and overlap FAX∈{0,8,16}. We train that model on synthetic networks generated as in the “Synthetic featured networks” section (setting α=0.8, and without adding any other noise on the features), with M∈{50,100,200,500,1000,2000,5000} communities of size n=10 nodes.
Only CPUs were used for the computations, running on a 4 × Intel Xeon CPU E7-8890 v4 server with 1.5 TB of memory. Using 8 parallel threads for training,Footnote 12 we record the peak memory usage,Footnote 13 training time, and full job timeFootnote 14 for each network size, averaged over the three overlap levels. Results are shown in Fig. 9. Note that in a production setting, multiplying the number of threads by n will divide compute times by nearly n, since the process is aggressively parallelised. A further reduced memory footprint can also be achieved by using sparse encoding for all matrices.