Keywords

1 Introduction

Classification of nodes in graphs is a relational classification problem where the labels of each node depend on its neighbors. Many problems in domains like image, biology, text or social data labeling can be formulated as graph node classification and this problem has been tackled with different approaches like collective classification [21], random walks [1], and transductive regularized models [10]. Most approaches consider homogeneous graphs, where all the nodes share the same set of labels, propagating labels from seed nodes to their neighbors. Many problems in domains like biology or social data analysis involve heterogeneous networks where the nodes and the relations between nodes are of different types, each node type being associated to a specific set of labels. For example, the LastFM social network, one of the datasets used in our experiments, links users, tracks, artists and albums via seven different types of relations such as friendship, most listened tracks, and authorship. In such a network, nodes of different types influence each other and their labels are interdependent. The dependency is, however, more complex than with homogeneous networks and depends both on the nodes type and on their specific relation. Classical methods for homogeneous graphs based for example on label propagation, usually relies on a simple relational hypothesis like homophily in social networks. They cannot be easily extended to heterogeneous networks, and new methods have to be developed for dealing with this relational classification problem.

In this paper, we consider the problem of node classification in heterogeneous graphs. We propose a transductive approach based on graph embeddings where the node embeddings are learned so as to reflect both the classification objective for the different types of nodes and the relational structure of the graph. When most embedding techniques consider deterministic embeddings where each node is represented as a point in a representation space, we focus here on density-based embeddings which capture some form of uncertainty about the learned representations. Uncertainty can have various causes related to the lack of information (isolated nodes in the graph) or because of the contradiction between neighboring nodes (different labels). Our hypothesis is that, because of these different factors, training will result in learned representations with different confidence, and that this uncertainty is important for this classification problem. For that, we will use Gaussian embeddings which have been recently proposed for learning word [23] and knowledge graph [7] embeddings in an unsupervised setting. More precisely, each graph node representation corresponds to a Gaussian distribution where the mean and the variance are learned. The variance term is a measure of uncertainty associated to the node representation. The objective function is composed of two terms, one reflecting the classification task and the other one reflecting the relations between the nodes. Both mono and multi-label classification can be handled by the model. For the experiments, we focus on classification in social network data. This type of data offers a variety of situations which allows us to illustrate the behavior and the performance of the model for different types of heterogeneous classification problems.

To summarize, our contributions are as follows: (i) We propose a new method for learning to classify nodes adapted to heterogeneous graph data; (ii) We model the uncertainty associated with the nodes representation; (iii) We provide a comparison with state of the art baselines on a series of social data classification problems representative of different situations.

2 Related Work

2.1 Graph Node Classification

Several different models have been proposed to solve the graph node classification task. We discuss below three main families [4] (i) collective classification, (ii) random walk type methods, and (iii) semi-supervised/transductive graph regularized models.

Random Walks. This family gathers methods where labels are iteratively propagated from seed nodes to all the other nodes in a network. Propagation follows a random walk or a similar iterative mechanism. [8, 28] are among the early ML models using random walks for classification in homogeneous graphs. [27] propose an extension of these models for heterogeneous graphs. It relies on hand-defined projections of the graph onto homogeneous graphs, the approach being difficult to adapt automatically to new datasets. The Graffiti random surfer model [1] is a state of the art random walk classifier for heterogeneous graphs. It is based on two intertwined random walks. Both are between nodes of the same type, but allowing either one hop (standard) or two-hop (extended) steps in the graph. It models up to a certain extent the influence among nodes of different types. In our preliminary tests on different datasets, this model was among the best ones.

Collective Classification. Collective classification algorithms are extensions of classical inductive classification to relational data. They take as input a fixed size vector composed of node features and of statistics on the node neighbors current labels. Sen et al. [21] provide an introduction and a comparison of some of these models. They distinguish between two families: local and global models. The former make use of local classifiers. In [14, 15] for example, naive Bayes classifiers are used iteratively, dynamically updating the attributes of nodes as inferences are made about their neighbors. Along these lines, [18] recently introduced an iterative model for sparsely labeled network which forces the label predictions to map the distribution of the observed data with a maximum entropy constraint. Global classifiers optimize a global loss function using graphical models, like e.g. Markov Random Fields. Iterative methods suppose features associated with nodes to learn the classifier, which is not the case in our work.

Random Walk Type Methods. This family gathers methods where labels are iteratively propagated from seed nodes to all the other nodes in a network. Propagation follows a random walk or a similar iterative mechanism. [8, 28] are among the early Machine Learning (ML) models using random walks for classification in homogeneous graphs. [27] propose an extension of these models for heterogeneous graphs. It relies on hand-defined projections of the graph onto homogeneous graphs, the approach being difficult to adapt automatically to new datasets. The Graffiti random surfer model [1] is a state of the art random walk classifier for heterogeneous graphs. It is based on two intertwined random walks. Both are between nodes of the same type, but allowing either one hop (standard) or two-hop (extended) steps in the graph. It models up to a certain extent the influence among nodes of different types. In our preliminary tests on different datasets, this model was among the best ones.

Semi-Supervised Transductive Learning. The third family has been developed for exploiting the manifold assumption in semi-supervised learning. The loss function is composed of two main terms, one is for classification on the labeled nodes, the other one is a propagation equation which encourages neighbor nodes to share similar labels. Seminal works in this direction include [2, 19, 24, 26]. All these models have been developed for homogeneous graphs and perform some form of label propagation similar to random walks. The difference with the latter is that the problem is formulated as a loss minimization one, which is more general than simply formulating a propagation rule. Relations between random walks and loss-based models are discussed more at length in [4, 29]. Extensions have been proposed over the years to handle more general situations. Multi-relational graphs where nodes are all of the same type, but can be linked by different relations are considered in [9, 12]. This also allows them to extend the transductive models to inductive formulations. Some authors have attempted to extend homogeneous formulations to the heterogeneous setting. All follow more or less the idea of projecting the heterogeneous graph onto a series of homogeneous ones, thus creating a series of homogeneous classification problems. Work in this direction includes [11] which is a direct extension of the homogeneous formulation in [25]. Graph projections have to be defined for each new problem and none of these models is able to directly exploit the correlation between nodes of different types. The work closer to ours is [10] who was among the first to propose an embedding model for transductive heterogeneous graph classification. This has been the starting point of our work, but they only consider deterministic representations while we use a more general transductive formulation with probabilistic embeddings.

To summarize heterogeneous graph classification approaches, very few allow modeling the influences between nodes of different types. In the experimental section, we will compare our model to [1, 10] which have been designed specifically for heterogeneous classification, as well as an unsupervised graph embedding model [22] and a homogeneous graph model [28].

2.2 Learning Representations for Graphs and Relational Data

In the last years, there has been a growing interest in learning latent representations. This has led to breakthroughs in domains like image recognition, speech or natural language processing [3, 13]. Graph node embeddings have been proposed for unsupervised learning where the goal is to learn node representations that preserves the graph structure and that can be exploited latter for different purposes like visualization, clustering or classification. [17] learns node representations by performing truncated walks on the graph – and supposing that nodes along the path should be close together in the representation space. [22] propose an algorithm designed for very large graphs, which can be used for different types of graphs (undirected, directed, weighted or not) – we use their method as our unsupervised baseline that embeds all data points, and then train a classifier on labeled ones. Somewhat related to this topic is the learning of embeddings for graphs where a unique representation of the whole graph is learned [20] and the learning of triplets in knowledge graphs where both relations and nodes representations are learned for ranking positive triplets over negative ones [5–7]. The setting is, however, quite different from the one considered here. Finally, modeling uncertainty via Gaussian embeddings has been proposed recently for unsupervised learning in [7, 23]. Based on sentences in the former and knowledge graph in the latter, they propose energy-based models to learn Gaussian embeddings. In this paper, we borrow their formalization and graph regularization cost in a transductive setting.

3 Model

In this section we present our model, namely Heterogeneous Classification with Gaussian Embeddings (HCGE).

We first introduce the notations used throughout this paper. A heterogeneous network is modeled as a directed weighted graph \(\mathcal {G} = (\mathcal {N},\mathcal {E}, \mathcal {W})\) where \(\mathcal {N}\) is the set of nodes, \(\mathcal {E}\) the set of edges and \(\mathcal {W}\) the weights associated to the edges. Each node \(x_i \in \mathcal {N}\) of the graph has a type \(t_i \in \mathcal {T}\), where \(\mathcal {T}={1,2,\ldots ,T}\). We denote \(N_i\) the neighbors of \(x_i\).

Regarding the classification task, let \(\mathcal {Y}^t\) denotes the set of categories associated with nodes of type t, and \(\#\mathcal {Y}^t\) the cardinality of \(\mathcal {Y}^t\). \(\mathcal {L} \subset \mathcal {N}\) is the set of indices of labeled nodes. For \(i \in \mathcal {L}\), \(y_i\) is the class vector associated to \(x_i\): node \(x_i\) belongs to category c if \(y_i^c = 1\) and does not belong if \(y_i^c = -1\).

In our model, each node \(x_i\) is mapped onto a representation which is a Gaussian distribution over the space \(z_i \sim \mathcal {N}(\mu _i, \varSigma _i)\) in \(\mathbb {R}^Z\). The latent space is common to all nodes. In this paper, we compare two different parameterizations of \(\varSigma \). We experimented with a spherical (\(\varSigma _i = \sigma _i Id\)) and a diagonal (\(\varSigma _i = diag\left( \sigma _i^p \right) _p\)) covariance matrix. We use a weight \(w_r\) for each type of relation. To simplify we use \(w_{ij}\) for the weight \(w_{r_{ij}}\) of the edge (i, j) linking node i to node j with a relation \(r_{ij}\).

Loss Function. We learn the representations of nodes and classifiers parameters by minimizing an objective loss function. It takes the general form of transductive regularized loss [11, 25], with a classification (\(\varDelta _C\)) and a regularization term (\(\varDelta _G\)), both being detailed later:

$$\begin{aligned} L(z,\theta ) = \sum \limits _{i \in \mathcal {L}} \varDelta _C(f_{\theta ^{t_i}}(z_i),y_i) + \lambda \sum \limits _{i\in \mathcal {N}}\sum \limits _{j\in N_i}w_{ij}\varDelta _G(z_i,z_j) \end{aligned}$$
(1)

As for classical transductive graph losses, the minimization in (1) aims at finding a trade-off between the difference between observed and predicted labels in \(\mathcal {Y}^t\), and the amount of information shared between two connected nodes. There are however major differences, since here z is not a label as in classical formulations, but a node embedding. Finally, the function \(f_{\theta ^t}(.)\) is a parametric classifier for a node of type t – there is one such classifier for each node type. Since we are using Gaussian embeddings, the z s are random variables and the regularization term is a dissimilarity measure between distributions.

To avoid overfitting, following [23], we regularize the mean and the covariance matrix associated to each node. We add two constraints to prevent means and covariances to be too large and to keep the covariance matrices positive definite (this also prevents degenerate solutions):

$$\begin{aligned} ||\mu _i|| \le C \text{ and } \forall p,\, m \le \sigma _i^p \le M \end{aligned}$$
(2)

where the different parameters C, m and M have been set manually after some trials on a subset of the DBLP training set to respectively 10, 0.01 and 10 (and not changed after that), but any other reasonable value will do.

The two following paragraphs refer to the respective parts of (1).

Classifier. The mapping onto the latent space is learned so that the labels of each type of node can be predicted from the (Gaussian) embedding. For that, we use a parametric classification function \(f_{\theta ^t}\) depending on the type t of the node. This multivariate function takes as input a node representation and outputs a vector of scores for each label corresponding to the node type. The parameters \(\theta ^t\) of the classifier are learned by minimizing the following loss on labeled data:

$$\begin{aligned} L_{Classification} = \sum \limits _{i\in \mathcal {L}} \varDelta _C(f_{\theta ^{t_i}}(z_i),y_i) \end{aligned}$$
(3)

where \(\varDelta _C(f_{\theta ^{t_i}}(z_i),y_i)\) is the loss associated with predicting labels \(f_{\theta ^{t_i}}(z_i)\) given the observed labels \(y_i\). We recall that in this equation \(f_{\theta ^{t_i}}(z_i)\) and \(y_i\) have values in \(\mathbb {R}^{\#\mathcal {Y}^t}\).

In our experiments, we used different losses for \(\varDelta _C\). We first considered the case where a class decision is simply the expectation of the classifier score together with a hinge loss, adapting the loss proposed in [10]. For a given node x of type t with an embedding z, this gives:

$$\begin{aligned} \varDelta _C(f_{\theta ^{t}}(z),y)=\varDelta _{EV}(f_{\theta ^t}(z),y) \overset{\text {def}}{=} \sum \limits _{k=1}^{\#\mathcal {Y}^t} \max \left( 0;1-y^k \mathbb {E}_{z}[ f^{k}_{\theta ^t}(z)]\right) \end{aligned}$$
(4)

where \(y^k\) is 1 if x belongs to category k and \(-1\) otherwise, and \(f^{k}_{\theta ^{t}}(z)\) is a random variable for category k.

Alternatively, the density based formulation allows us to leverage the density-based representation through a probabilistic criterion, even in the case of linear classifiers. We used here for \(\varDelta _C\) the \(\log \)-probability that \(y^kf_{\theta ^{t}}(z)\) take a positive value. In this case, the variance will be influenced by the two loss terms: if the two terms act in opposite directions, one solution will be to increase variance. As we will see, this is confirmed by the experiments.

$$\begin{aligned} \varDelta _C(f_{\theta ^{t}}(z),y)=\varDelta _{Pr}(f_{\theta ^{t}}(z),y) \overset{\text {def}}{=} -\sum \limits _{k=1}^{\#\mathcal {Y}^t} \log \mathbb {P}\left( y^k f^{k}_{\theta ^t}(z) > 0\right) \end{aligned}$$
(5)

In our experiments and for both costs, we used a linear classifier for \(f^{k}_{\theta ^{t}}\), which allows to easily compute the different costs and gradients, since the random variable \(f^{k}_{\theta ^{t}}(z)\), being a linear combination of Gaussian variables, is Gaussian too. A basic derivation shows that:

$$\begin{aligned} \mathbb {P}\left( y^k f^{k}_{\theta ^t}(z) > 0\right) = \frac{1}{2} \left( 1 + \mathrm {erf}\left( \frac{\mu \cdot \theta ^t}{\sqrt{2\sum _p{(\theta ^t_p \sigma ^p)^2}}}\right) \right) \end{aligned}$$
(6)

where \(\mathrm {erf}\) is the Gauss error function.

There are some notable differences between the two classification losses during learning. In the case of a linear classifier \(f_{\theta ^{t}}\), \(\mathbb {E}_{z}[ f^{k}_{\theta ^{t}}(z)] = \mu \cdot \theta ^{t}_k\). Thus, minimizing \(\varDelta _{EV}\) only updates the mean of the Gaussian embedding: the covariance matrix of the embedding does not interfere with the classification term, and is only present in the second term of (1).

For the \(\varDelta _{Pr}\) loss, the probability is proportional to \(\mathrm {erf}\left( \frac{\mu \cdot \theta ^t}{\sqrt{2\sum _p{(\theta _p^t \sigma ^p)^2}}}\right) \) where the variance is present. When the graph regularization and classification cost pull the representation mean in opposite directions (opposite gradients), the model will respond by increasing the variance for the spherical variance modelFootnote 1: this behavior is interesting since it transforms an opposition between regularization and classification costs into increased uncertainty.

Graph Embedding. We make the hypothesis that two nodes connected in the graph should have similar representations, whatever their type is. Intuitively, this will force nodes of the same type which are close in the graph to be close in the representation space. The strength of this attraction between nodes of the same class will be proportional to their closeness in the graph and to the weight of the path(s) linking them. We use the asymmetric loss proposed in [7, 23]:

$$\begin{aligned} L_{Graph} = \sum \limits _{i}\sum \limits _{j \in N_i} w_{ij} D_{KL}(z_j||z_i) \end{aligned}$$
(7)

where \(\varDelta _G(z_i,z_j) = D_{KL}(z_j||z_i)\) is the Kullback-Leibler divergence between the distributions of \(z_i\) from \(z_j\):

$$\begin{aligned} \begin{aligned} D_{KL}(z_j||z_i)&= \int _{x\in \mathbb {R}} \mathcal {N}(x; \mu _j,\varSigma _j) \log \frac{\mathcal {N}(x; \mu _j,\varSigma _j)}{\mathcal {N}(x; \mu _i,\varSigma _i)} dx \\&=\frac{1}{2}\left( \mathrm {tr}(\varSigma _i^{-1}\varSigma _j) + (\mu _i - \mu _j)^T\varSigma _i^{-1}(\mu _i-\mu _j) - d - \log \frac{\det (\varSigma _j)}{\det (\varSigma _i)}\right) \end{aligned} \end{aligned}$$
(8)

The loss \(L_{Graph}\) is a sum over the neighbors \(N_i\) of i, where \(w_{ij}\) is the weight of the edge between \(x_i\) and \(x_j\). Other similarity measures between distributions could be used as well, the Kullback-Leibler divergence having the advantage of being asymmetric, which fits well the social network datasets used in the experiments.

Algorithm. Learning the Gaussian embeddings \(z \sim \mathcal {N}(\mu ,\varSigma )\) and the classifiers parameters \(\theta \) consists in minimizing loss function in (1). We used here a Stochastic Gradient Descent Method to learn the latent representations, i.e. the \(\mu _i\), \(\varSigma _i\) as well as the parameters \(\theta \) of the classifiers.

Our algorithm samples a pair of connected nodes and then makes a gradient update of the nodes parameters. For each sampled node \(z_i\) that is part of the labeled training set \(\mathcal {L}\), the algorithm performs an update according to the first term of (3). This update consists in successively modifying the parameters of the classification functions \(\theta ^{t_i}\) and of the latent representations \(\mu _i\) and \(\varSigma _i\) so as to minimize the classification loss term. Then, the model updates its parameters with respect to the smoothness term of (7). Note that, while we use a stochastic gradient descent, other methods like mini-batch gradients or batch algorithms could be used as well.

4 Experiments

4.1 Datasets

Experiments have been performed on three datasets respectively extracted from DBLP, Flickr and LastFM. For all but the first dataset (DBLP), each node can have multiple labels. The three datasets are described below and summarized in Table 1.

Table 1. Datasets

The DBLP dataset is a bibliographic network composed of authors and papers. Authors are labeled with their research domain (4 different domains) while papers are labeled with the conference name they were published in (20 labels). Authors and papers are connected through an authorship relation. The graph is thus composed of two types of nodes and is bipartite with only one relation type. Classification is monolabel on papers and authors.

The Flickr corpus is a dataset composed of photos and users. The photo labels correspond to different possible tags while the user labels correspond to their subscribed groups. The classification problem is multi-label: images and users may belong to more than one category. Photos are related to users through an authorship relation, while users are related to others through a following relation. We have kept the image tags that appear in at least 500 images, and user categories that also appear at least 500 times in the dataset resulting in 21 possible labels for photos and 42 for authors.

The LastFM dataset is a social network composed of users, tracks, albums and artists. This dataset was extracted using the LastFM APIFootnote 2. The task is multi-label, and all node types have their specific set of labels. Users are labeled with the type of music they like (59 labels), tracks with the kind of music they belong to (28 labels), albums with their type (47 labels) and artists with the kind of music they play the most (47 labels). Users are related to users (friendship), tracks (favorite tracks), albums (favorite albums) and artists (favorite artists). Tracks are related to albums (belong to) and artists (singer). Finally, albums are related to artists (sing in). Note that one track can be related to several artists, and an album can be related to several artists. This dataset contains tracks labeled by their genre (rock, indie, ...), users by the type of music they like (female vocalists, ambient, ...), albums by their type (various artists, live, ...) and artists by the kind of music they make (folk, singer songwriter, ...). Some labels may be the same string-wise for different types of nodes, but we consider that labels of different types of nodes are distinct, e.g. pop is not the same for an artist or a track.

We compare our approach with four state-of-the-art models (see Sect. 2):

  • LINE [22], which is representative of unsupervised learning of graph embeddings suitable for various tasks such as classification. We performed a logistic regression with the learned representations as inputs.

  • HLP [28], which is representative of transductive graph algorithms developed for semi-supervised learning. As HLP is designed for homogeneous graphs, we perform as many random walks as the number of node types, considering each time that all the nodes are of a same given type.

  • Graffiti [1], which is a state of the art model for the task of classification with random walk in heterogeneous graph.

  • LSHM [10], which is another state of the art model for the task of classification with deterministic vector representations in heterogeneous graph.

Evaluation Measures and Protocol. For the evaluation, we have considered two different evaluation measures. The Precision at 1 (P@1) measures the percentage of nodes for which the category with the highest score is among the observed labels. The Precision at k (P@k) is the proportion of correct labels in the set of k labels with the highest predicted scores. Here micro P@k is an average on all the node types, with k set to the number of relevant categories. This is a measure of the capacity of a model to correctly pick the k relevant categories of any node. In the case of DBLP (mono-label dataset), we consider that the predicted category is the category with the highest score. We make use of the Precision at 1 (P@1) measure as there is at most one label per node. We optimize and compare the different models with regard to micro-average, and also report macro-average.

Regarding the experimental protocol, we partition a dataset into two different subsets, namely a training set and a testing set. As all the models have hyperparameters, one subset of the training set is used as a validation set to optimize by grid search the hyperparameters. The optimization is done with respect to the Micro P@k measure, which corresponds to the mean of P@k over all nodes. The other part of the training set is used to learn the parameters of the different models. We then compare the different models based on the results on the testing set, by using the model for which the performance over the validation set was the best.

Experiments are performed with different training set sizes: \(10\,\%\), \(30\,\%\), \(50\,\%\). Within our transductive setting, the training set size refers to the proportion of labeled nodes used in the training setFootnote 3. The training nodes are selected at random. The proportion of nodes used during the parameters training phase and used for the hyperparameters selection depends on the size of the training set. We use 50–50 for a training set size of \(10\,\%\) and 80–20 (train/validation) for the others. Experiments are performed with 5 random splits. The hyper-parameters are selected for each split using the validation set. We then average 5 runs over each split.

4.2 Results

In this section we present the results of four variants of our Gaussian embedding model, and compare to LINE [22], Graffiti [1], HLP [28] and LSHM [10]. The experiments are performed on the three datasets described in Table 1 and the results are described in Tables 2 (DBLP), 3 (FlickR) and 4 (LastFM). The best performing classifier (on the test set) is presented in bold.

Concerning the four variants of our model, HCGE(\(\varDelta _\bullet \), X) refers to the HCGE model with the classification loss \(\varDelta _\bullet \) (\(\varDelta _{EV}\) or \(\varDelta _{Pr}\)) and a spherical (X=S) or diagonal (X=D) covariance matrix.

Table 2. P@1 DBLP
Table 3. P@k FlickR
Table 4. P@k LastFM

For micro P@k, our model generally outperforms the others on all the datasets. Supervised models (HLP, Graffiti, LSHM and HCGE) using the class information outperform unsupervised representation learning, which matches the results reported in [10]. On all datasets, the performances of HLP are below the performances of Graffiti, LSHM and HCGE. This clearly shows that modeling the heterogeneity of the graph brings noteworthy improvements. Comparing the heterogeneous models, both LSHM and HCGE outperform Graffiti on all datasets. On average, compared to Graffiti, LSHM is 2.4 better on DBLP, 2.1 better on FlickR and 2.5 better on LastFM. We observed the same behavior for HCGE, with +2.8 on DBLP, +4.4 on FlickR and +6.0 on LastFM. We can note that the more complex the dataset, the higher the gap compared to the baselines. This also shows that the use of representations can clearly improve the performances.

On each dataset, our model outperforms LSHM (and the other competitors) 8 times over 9, with on average +1.0 points for DBLP, +2.3 for FlickR, and +3.8 for LastFM over the second ranked model. According to the results, introducing uncertainty in representations clearly improves results when compared to LSHM. Let us also point out that, according to our initial intuition, the effect of using uncertainty has more impact when the amount of training data is lower: the difference between LSHM and HCGE decreases in general when more training data is available (except for DBLP).

Let us compare the performance of the variants \(\varDelta _{EV}\) and \(\varDelta _{Pr}\). Globally, \(\varDelta _{Pr}\) seems to be disadvantaged by a low number of training examples, when \(\varDelta _{EV}\) seems to be more stable in comparison to other baselines. However, the more training data, the closer the \(\varDelta _{Pr}\) variant is to \(\varDelta _{EV}\). For example, on the DBLP dataset, moving from 10 % to 30 % improves on average \(\varDelta _{Pr}\) results by +13.7 but only by +11.1 for \(\varDelta _{EV}\). For a training set size of 50 %, the difference between \(\varDelta _{Pr}\) and \(\varDelta _{EV}\) is +1.1 on DBLP, and +0.1 on FlickR. For LastFM, the difference is resp. \(-14.6\) for 10 %, \(-6.5\) for 30 % and \(-1.5\) for 50 % of the dataset used for training. On the three datasets, the lower the training set size, the better \(\varDelta _{EV}\) seems to be compared to \(\varDelta _{Pr}\). We could not explain this difference in the behavior between \(\varDelta _{EV}\) and \(\varDelta _{Pr}\), but believe that this is due to the fact that the covariance matrix is only optimized in the graph regularization term in the case of \(\varDelta _{EV}\). Let us now compare the use of a spherical and a diagonal covariance matrix. For the \(\varDelta _{EV}\) variant, it looks like moving from a spherical covariance matrix to a diagonal one brings no improvement. It even decreases the performance on DBLP. Concerning the \(\varDelta _{Pr}\) variant, for which the covariance matrix plays a role in the classification cost, conclusions are reversed and using diagonal covariance matrices improves the results. On the FlickR dataset, using a diagonal variance improves the results by 1.4 on average. However, it looks like the more training data, the less the improvement, with +2.2 improvement for a training set size of 10 %, +1.0 for 30 % and +1.1 for 50 %.

4.3 Qualitative Discussion

In this section, we focus on studying qualitatively the representations found by HCGE. We consider the most robust variant of our model (\(\varDelta _{EV}\), S), and the most challenging dataset, LastFM (similar observations were made on the other datasets). We will examine the respective role of regularization and classification costs on labeled training nodes, and the relationship between the learned variance of a node and the local node properties (like its number of neighbors).

We first examined the respective role of classification and regularization costs. In (1), the max-margin classification cost implies that the gradient of a node x is 0 if \(y^k \mathbb {E}_{z}[ f^{k}_{\theta ^t}(z) ]\) is above 1. In this case, the only constraints on the node are due to the graph regularization cost. We can see how many of the nodes are used by the classification cost by looking at the number of cases for which \(y^k \mathbb {E}_{z}[ f^{k}_{\theta ^t}(z) ]\) is below or equal to 1. In Fig. 1a, we have shown a histogram of \(y^k \mathbb {E}_{z}[ f^{k}_{\theta ^t}(z) ]\) for labeled nodes in the training set (after convergence). For around one third of the nodes, the value of the classifier is above 1.1 – they could be removed from the labeled set without harming the solution (however, these could have been useful in early stages of optimization). This is clearly in agreement with the experiments where we have shown that representation-based models were performing better than the others, and suggests that it would be interesting to use these statistics to predict the performance of the model on held-out data.

Fig. 1.
figure 1

Qualitative results for the model HCGE(\(\varDelta _{EV}\), S) on the LastFM dataset with 50 % of the dataset used for training. In Fig. 1b, we computed Gaussian kernel density to show high density regions in the plot.

Regarding the relationship between the learned variance and the local properties of each node, we looked at the relationship between the PageRankFootnote 4 [16] of a node and its variance. Figure 1b shows that high PageRank implies a small variance. Which means that for central nodes, representations are less uncertain. However, the reverse implication is not true.

5 Conclusion

We have explored the use of uncertainty for learning to represent nodes in the challenging task of heterogeneous graph node classification. The proposed model, Heterogeneous Classification with Gaussian Embeddings (HCGE), learns for each node a Gaussian distribution over the representation space, parameterized by its mean and covariance matrix, by optimizing a loss function that includes a classification loss and graph regularization loss. We have examined four variants of this model, by using either spherical and diagonal covariance matrices, and by using two different loss functions for classification. Our model can easily be extended to inductive learning by defining the Gaussian representation z as a parameterized function of the input features.

Based on the experimental results obtained on datasets representative of different situations, our main findings are that (i) integrating uncertainty in representations improved classification (ii) according to our initial intuition, the effect of using uncertainty has generally more impact when the amount of training data is lower and (iii) according to our expectation, highly central nodes seem to have less variance associated to their representation.

Future work will address more in detail the relationship between the variance and node properties, as well as understanding the interplay between regularization and classification loss when both include the variance in their formulation.