1 Introduction

Graphs are powerful data structures that allow us to easily express connectivity (in the form of edges) among entities (known as nodes). Real-world graphs are ubiquitous, e.g. they can be in the form of social networks [1, 2], biological networks [3, 4], knowledge graphs [5], publication citations networks [2, 6], etc. There have been many machine learning applications on graphs, e.g. determining a community a person belongs to on an online social network, classifying the functional role of a molecule in a biological interaction graph, or predicting purchase patterns in buyers–products–sellers graphs in online e-commerce platforms, etc. Graph representation learning in machine learning—also known as graph embedding—has been proposed to encode the structural information of the graph by constructing an embedding vector for each node in graph as the node’s representation. In other words, they map any node in the graph to a low-dimensional Euclidean space. Of course, the goal of graph representation learning is to optimize this mapping so that geometric relationships in this learned space reflect the structure of the original graph. Such graph embedding [7, 8] has achieved great successes in machine learning tasks, such as node classification and link prediction [1, 6, 9, 10].

Graph representation learning involves incorporating structural information of the graph. However, in practice, graphs not only contain the structure information but also contain properties (also called attributes) of nodes. For example, a publication citation network consists of papers as nodes and citation relationships as edges. Here, the nodes have properties related to the content of the papers. Similarly, the social network (such as Twitter) can be represented by a graph, in which every user is represented by a node with properties such as user profile and user behaviours. Of course, the follower/followed relationships among nodes are the structural information depicted with edges. We call graphs as property graphs, in which other than the structural information, there are certain properties associated with each node. In the real-world network, there are two types of property graphs. One is homophilous graphs in which most connected nodes are from the same class or with similar properties, such as citation networks where papers mostly cite research from the same research area, and social networks where users tend to connect to users with similar interests. By contrast, graph with heterophily describes the preference of nodes to connect to nodes not similar to them. Heterophilous graph often occurs in financial transaction networks where fraudsters often perform transactions with non-fraudulent users. In dating networks, most connections are between people of opposite genders and this is also an example of heterophilous graph [11]. Note, in our study, we focus on homophilous property graphs, i.e. edges in graphs tend to connect to similar nodes.

Among the existing work of graph representation learning, Graph Neural Networks (\({ \mathop { \texttt {GNN} } }\)) are undoubtedly the most effective to model the property graphs with homophily. Conceptually, the fundamental idea of \({ \mathop { \texttt {GNN} } }\) models is to employ deep artificial neural networks to learn an embedding of each individual node in the graph by aggregating not only feature information of the node but as well as its local neighbourhood. For example, Graph Convolutional Network (\({ \mathop { \texttt {GCN} } }\)) [6]—an example of \({ \mathop { \texttt {GNN} } }\), performs convolution on graph by transforming node representations into the spectral domain using the graph Fourier transform. \({ \mathop { \texttt {GraphSAGE} } }\) [1] extends \({ \mathop { \texttt {GCN} } }\) from a spectral method to a spatial one and can efficiently generate node embeddings for previously unseen data by sampling and aggregating features from a node’s local neighbourhood.

Such \({ \mathop { \texttt {GNN} } }\) have shown remarkable performance in many downstream tasks especially on graph with good homophily nature [12]. Despite the power of these \({ \mathop { \texttt {GNN} } }\) methods, they have certain limitations:

  • They are likely to be ineffective in aggregating neighbouring information for nodes that have no or few relations with other nodes in the graph.

  • The neighbourhood of a node is defined as the set of all neighbours which are one or more hops away. We conjecture that there might be nodes which could be very similar to the node in question but are not in its neighbourhood. Existing methods are not able to aggregate such highly similar (non-neighbourhood) nodes.

  • Spatial-based methods like \({ \mathop { \texttt {GraphSAGE} } }\) [1], sample all neighbours equally when aggregating their information. It does not consider the fact that different neighbours may influence the node rather differently.

To address the above limitations, in this paper, we propose a novel framework named Enhanced Property Graph Embedding (\({ \mathop { \texttt {EPGE} } }\))—for graph representation learning in a property graph. \({ \mathop { \texttt {EPGE} } }\) has the following salient features:

  • To address the first and second challenges described above, apart from the existing (original) graph, we create a latent graph based on the node property information (illustrated in Fig. 1). This will help drastically for nodes with none or few neighbours in the original graph. Moreover, the latent graph has the ability to capture the important features from distant but informative nodes.

  • To address the third challenge, a bias strategy is applied to sample neighbours (not only immediate neighbours but latent neighbours) for differentiating the influences of the neighbours. We have proposed a sampling strategy to help choose the most informative neighbours.

Fig. 1
figure 1

Illustration of Latent Graph. The original graph is on the left, and we create a latent graph on the right. Dash lines represent latent connections in which the new neighbours of the red node are constructed based on the node property similarity

Once the neighbours are selected, we aggregate the immediate and latent neighbourhoods to compute the final node embeddings. We claim that the final embeddings obtained with \({ \mathop { \texttt {EPGE} } }\) are much more powerful than existing state-of-the-art methods. Additionally, to back up this claim, a quantitative evaluation metric is defined to measure the usefulness of sampled neighbourhood information. By conducting experiments on five public graph datasets, we demonstrate the superior performance of \({ \mathop { \texttt {EPGE} } }\) over existing state-of-the-art baselines. Moreover, we separately validate the efficacy and performance of the salient features of \({ \mathop { \texttt {EPGE} } }\) by performing detailed ablation studies. We also discuss how various parameters (e.g. the parameter of latent connection construction, the size of sampling neighbours set and the number of the edges) impact the model performance.

Our contributions are summarized as follows:

  • We propose a method for property graph representation learning, which not only exploits the existing graph, but also builds a latent graph. In addition, it has an effective neighbourhood sampling technique.

  • We define a quantitative metric value to measure the usefulness of the sampled neighbourhood, which helps evaluate the superiority of our proposed method.

  • We report an extensive experimental analysis to evaluate the merits of our proposed algorithms.

The rest of the paper is organized as follows. Section 2 reviews existing work on graph representation learning. Section 3 provides the preliminaries followed by Sect. 4 that presents the proposed method. Section 5 defines a quantitative index to evaluate and explain the effectiveness of the proposed \({ \mathop { \texttt {EPGE} } }\) model. We discuss experimental results in Sect. 6 and conclude with our contributions and future work in Sect. 7.

2 Related work

A growing body of literature has been devoted to graph representation learning. In the following, let us discuss a few prominent lines of direction.

2.1 Random walk-based methods

Random walk-based methods are one of the early approaches to graph representation learning that approximate various characteristics such as node centrality [13] and similarity [14] of the graph.

Two prominent examples of graph embedding techniques based on random walk are DeepWalk [15] and Node2vec [16]. DeepWalk [15] proposed to use a skip-gram model to learn node embeddings by constructing the relationships among nodes based on paths obtained from random walks. It was found statistically that the frequency that nodes appear in the short random walks will follow a power-law distribution as the word frequency in natural language. Therefore, language modelling can be applied to graph representation learning. DeepWalk presented a generalization of language modelling to explore the graph through a stream of short random walks. These walks can be thought of as short sentences and nodes in walks are analogous to words in sentences. Node2vec [16] utilized biased random walks by reaching a trade-off between breadth-first and depth-first graph search. Specifically, Node2vec introduced two hyper-parameters \(\rho \) and q to control the breadth-first search and depth-first search in the random walk, respectively. Grid search is used to seek the optimal hyper-parameters for network representation learning. Therefore, Node2vec can not only obtain local topology information of the node but also explore deeper structural information, thereby improving the effectiveness of network representation learning. Walklets [17] modified the random walk strategy in DeepWalk. By skipping over steps in each random walk, Walklets generated a corpus of node pairs that are reachable via paths of a fixed length. This corpus can then be used to learn a series of latent representations, each of which captures successively higher-order relationships from the adjacency matrix. By this random walk strategy,  Walklets can capture the relationship among nodes with a larger spatial scale. HARP [18] utilized a graph coarsening procedure to collapse related nodes in the graph together into super nodes. This coarsened graph was used to learn a set of initial representations, and the learned embedding of each super node was used as an initial value for the random walk embeddings of the super node’s constituent nodes. The process can repeat in a hierarchical manner at varying levels of coarseness. In [7], it was shown that random walks-based methods are inefficient for processing large graphs, because the node embeddings are independent and there is no sharing of parameters. In addition, only the structure information of the graph is learned and the properties of nodes are not taken into consideration in such models.

2.2 Graph neural networks

Growing research in deep learning over the past few years has led to a deluge of deep neural networks-based methods applied to graphs [1, 6, 19], leading to a formulation known as the Graph Neural Networks (\({ \mathop { \texttt {GNN} } }\)). Unlike random walk-based methods, \({ \mathop { \texttt {GNN} } }\) encode nodes into vectors by aggregating feature information from node’s local neighbourhood via neural networks.

Several researchers have attempted to define convolutions operators on graph to learn graph presentation. Graph convolutions can often be categorized as spectral approaches and spatial approaches. Spectral approaches perform convolution by transforming node representations into the spectral domain using the graph Fourier transform or its extensions. Bruna et al. [20] first introduced convolution for graph data from the spectral domain using the graph Laplacian matrix and used a learnable diagonal matrix as the filter. However, this operation in [20] is computationally inefficient and the filter is non-spatially localized. To solve this efficiency problem [21], proposed the ChebNet and improved the spectral-based approach by using a polynomial filter. Kipf and Welling [6] further simplified the filtering by using only the first-order neighbours. Spatial approaches perform convolutions directly on the graph based on the graph topology. The major challenge of spatial approaches is defining the convolution operation with differently sized neighbourhoods. Duvenaud et al. [22] proposed a spatial method that used different weight matrices for nodes with different degrees, but it may not be scalable to large-scale graphs with more node degrees. Atwood and Towsley [23] used transition matrices to define the neighbourhood for nodes. Niepert et al. [24] defined a “receptive field” for each node by selecting a fixed number of nodes from its k-step neighbourhoods and adopted a standard 1-D CNN with proper normalization to learn graph embedding.

Many techniques have been introduced to further improve \({ \mathop { \texttt {GNN} } }\), especially in spatial approaches, and some of these methods are general. Inspired by the attention mechanism, [10] incorporated the attention mechanism into \({ \mathop { \texttt {GNN} } }\) so that the node neighbourhoods are aggregated with different weights. Some methods added “skip connections” to make \({ \mathop { \texttt {GNN} } }\) models deeper. In [2], researchers explored an architecture that learned to selectively exploit information from neighbourhood of differing locality. They proposed the Jumping Knowledge Networks that selectively combines different aggregations at the last layer, i.e. the node representations of each layer directly “jumps” to the last layer. This network learned the representations of different orders for different graph substructures; hence, the trained model can improve the representations, in particular for graphs with sub-graphs of diverse local structures. In the area of computer vision, a convolutional layer is usually followed by a pooling layer to get more general features. Similar to these pooling layers, some research focuses on designing hierarchical pooling layers on graphs. H-GCN [25] repeatedly aggregated nodes with similar structures to form some hyper-nodes, followed by refining the coarsened graph to the original with an aim to restore the representation for each node. Instead of merely aggregating one or two-hop neighbourhood information, the proposed coarsening procedure enlarged the receptive field for each node; hence, more global information can be captured. \({ \mathop { \texttt {GNN} } }\) models aggregate messages for each node from its neighbourhood. Intuitively, if multiple \({ \mathop { \texttt {GNN} } }\) layers are implemented, the size of neighbours will grow exponentially with the depth. Therefore, a sampling technique is adopted to alleviate this “neighbour explosion” issue. GraphSAGE [1] was able to efficiently generate node embeddings for previously unseen data by sampling and aggregating features from a node’s local neighbourhood. GraphSAGE [1] does not utilize the full set of neighbours but a fixed-size set of neighbours by uniformly sampling. Our method is based on GraphSAGE and proposes an effective neighbour sampling technique.

In addition to the above techniques in \({ \mathop { \texttt {GNN} } }\), recent researchers have attempted to build \({ \mathop { \texttt {GNN} } }\) by designing and optimizing the network architecture. DAGNN [26] proposed a deep adaptive graph neural network to learn node representations from larger receptive fields. N-GCN [27] trained multiple instances of \({ \mathop { \texttt {GCN} } }\) over node pairs discovered at different distances in random walks and learned a combination of the instance outputs. In fact, our proposed method can also obtain larger receptive fields by introducing the latent neighbours that are similar but are far away from each other.

More recent attention has been focused on \({ \mathop { \texttt {GNN} } }\) models on graphs with heterophily, where most connected nodes are from different classes. Zhu et al. [28] proposed a framework called CPGNN that incorporated an interpretable compatibility matrix for modelling the heterophily or homophily level in the graph, enabling it to go beyond the assumption of strong homophily. Wang et al. [12] designed a propagation mechanism, which can automatically change the propagation and aggregation process according to homophily or heterophily between node pairs. By introducing two measurements of homophily degree, this model can adaptively learn the propagation process. Du et al. [29] proposed a \({ \mathop { \texttt {GNN} } }\) model based on a bi-kernel feature transformation and a selection gate. Two kernels capture homophily and heterophily information, respectively, and the gate is introduced to select which kernel is used for the given node pairs. Platonov et al. [11] declared that the most significant drawback of the standard datasets used for evaluating heterophily-specific models is the presence of a large number of duplicate nodes in the datasets, leading to train–test data leakage and making results obtained by using these datasets unreliable. Platonov et al. [11] showed that standard \({ \mathop { \texttt {GNN} } }\) achieved strong results on heterophilous graphs, almost always outperforming specialized models. Compared with these studies that automatically learn the homophily or heterophily between node pairs to improve the performance of standard \({ \mathop { \texttt {GNN} } }\) models under heterophily, our proposed framework EPGE focus on homophilous graphs that aims to capture the similarities not only from the local neighbours but also the explicit, i.e. latent node to enhance node representation to the maximum extent.

3 Preliminaries

Let us discuss some preliminaries in this section.

3.1 Graph definition

Let \({\mathcal {G}} = ({\mathcal {V}}, {\mathcal {E}})\) represents an undirected graph with N nodes and their connection edges. We have a set of nodes—\(v_i \in {\mathcal {V}}\) and edges \((v_i, v_j) \in {\mathcal {E}}\). The features of nodes are denoted as \({\textbf{Z}} = \{{\textbf{z}}_1,..., {\textbf{z}}_N\} \in R^{N \times F}\) where F denotes the size of the feature vector. For any node \(v_i \in {\mathcal {V}}\), \({\mathcal {N}}(v)\) is the set of nodes that are in the neighbourhood of node v based on \({\mathcal {E}}\).

3.2 Graph representation learning

By definition, graph representation learning (graph embedding) is an approach that learns a mapping from high-dimensional sparse graphs into low-dimensional, dense and continuous vector spaces, while maximally preserving the graph structure properties. Graph embedding has been used in the literature in two ways [4]:

  • Graph Embedding encodes each node of a graph with its own vector representations with a smaller dimension, with the following definition:

Definition 1

(Graph Embedding) Given a graph \({\mathcal {G}} = ({\mathcal {V}}, {\mathcal {E}})\), having nodes \({\mathcal {V}}\) and their connecting edges \({\mathcal {E}}\). Graph Embedding is a mapping \({f}: v_i \in {\mathcal {V}} \rightarrow {\textbf{y}}_i \in {\mathbb {R}}^d, \) such that \(d \ll \vert {\mathcal {V}} \vert \) and the function f preserves some proximity measure, like node similarity in the original graph \({\mathcal {G}}\).

  • Whole-graph Embedding is to represent the whole graph in the form of latent vectors and defined as:

Definition 2

(Whole-graph Embedding) Given a set of graphs \({\mathcal {G}} = \{{\mathcal {G}}_1,..., {\mathcal {G}}_m\}\), whole-graph embedding is a mapping \({f}: {\mathcal {G}}_i \rightarrow {\textbf{y}}_i \in {\mathbb {R}}^d, \) such that the function f preserves some proximity measure defined on graph \({\mathcal {G}}\).

In this study, we use the former definition. As thus, an embedding maps each node to a low-dimensional feature vector, and such generated nonlinear and highly informative graph embeddings can be conveniently used to address the node classification task.

3.3 GraphSAGE

Fig. 2
figure 2

Aggregators in the \({ \mathop { \texttt {GraphSAGE} } }\)

As we discussed earlier, \({ \mathop { \texttt {GraphSAGE} } }\) [1] is an improvement over the original \({ \mathop { \texttt {GCN} } }\) model. \({ \mathop { \texttt {GCN} } }\) is trained independently for a fixed graph and requires full graph Laplacian, so it is inherently transductive learning. \({ \mathop { \texttt {GraphSAGE} } }\) replaces the full graph Laplacian with learnable aggregation functions which are key to performing message passing, and it is a general inductive learning framework that can efficiently generate the embedding of unknown nodes by using the feature information of nodes. Such ability for inductive learning is important for processing large-scale graphs, leading to strong generalization performance to unseen nodes.

The core idea of \({ \mathop { \texttt {GraphSAGE} } }\) is to generate embeddings of the target node by learning an aggregator function that samples and aggregates features from a node’s local neighbourhood, as shown in Fig. 2. The sampling strategy used in \({ \mathop { \texttt {GraphSAGE} } }\) is to uniformly sample a fixed-size set of neighbours and sample with replacement in case where the sample size is larger than the node’s degree. Five candidate aggregator functions are used namely

  1. 1.

    MEAN aggregator,

  2. 2.

    GCN aggregator,

  3. 3.

    LSTM aggregator,

  4. 4.

    MeanPooling aggregator and

  5. 5.

    MaxPooling aggregator.

Let us discuss these five aggregators at kth depth in the following.

MEAN aggregator is defined as:

$$\begin{aligned} \begin{aligned} {\textbf{h}}^k_{{\mathcal {N}}(v)} \leftarrow \text {MEAN}\Big (\{{\textbf{h}}^{k-1}_u, \forall u \in {\mathcal {N}}(v)\}\Big ); \\ {\textbf{h}}^k_v \leftarrow \sigma \Big ({\textbf{W}} \cdot \text {CONCAT} ({\textbf{h}}^{k-1}_v, {\textbf{h}}^k_{{\mathcal {N}}(v)})\Big ). \end{aligned} \end{aligned}$$
(1)

\({\mathcal {N}}(v)\) is the immediate neighbourhood set of node v. \({\textbf{h}}\) denotes a node’s representation at this step. \(\text {MEAN}\) is the mean operator, where the element-wise mean of the vectors \({\textbf{h}}^k_{{\mathcal {N}}(v)}\) is taken. The immediate neighbourhood is aggregated into a single vector \({\textbf{h}}^k_{{\mathcal {N}}(v)}\), and \({ \mathop { \texttt {GraphSAGE} } }\) then concatenates this aggregated neighbourhood vector with the node’s current representation \({\textbf{h}}^{k-1}_v\). \({\textbf{W}}\) denotes the weight matrix, and \(\sigma \) is a nonlinear activation function.

On the other hand, GCN aggregator is defined as:

$$\begin{aligned} {\textbf{h}}^k_v \leftarrow \sigma \Big ({\textbf{W}} \cdot \text {MEAN} (\{{\textbf{h}}^{k-1}_v\} \cup \{{\textbf{h}}^{k-1}_u, \forall u \in {\mathcal {N}}(v)\}\Big ), \end{aligned}$$
(2)

where \(\cup \) is the union operation.

LSTM aggregator replaces the MEAN operation in Eq. 1 as:

$$\begin{aligned} {\textbf{h}}^k_{{\mathcal {N}}(v)} \leftarrow \text {LSTM}\Big (\{{\textbf{h}}^{k-1}_u, \forall u \in \pi ({\mathcal {N}}(v))\}\Big ), \end{aligned}$$
(3)

where \(\text {LSTM}\) (long short-term memory) is a special Recurrent Neural Network to process inputs in a sequential manner. \(\pi (\cdot )\) is a random permutation operation.

MeanPooling aggregator replaces the MEAN operation in Eq. 1 as:

$$\begin{aligned} {\textbf{h}}^k_{{\mathcal {N}}(v)} \leftarrow \text {MEAN}\Big (\{\sigma ({\textbf{W}}_\mathrm{{pool}}{\textbf{h}}^{k-1}_{u_i} + {\textbf{b}}), \forall u_i \in {\mathcal {N}}(v)\}\Big ). \end{aligned}$$
(4)

Here, \({\textbf{W}}_\mathrm{{pool}}\) is the weight matrix of a fully connected layer in this pooling aggregator.

Finally, MaxPooling aggregator replaces the MEAN operation in Eq. 1 as:

$$\begin{aligned} {\textbf{h}}^k_{{\mathcal {N}}(v)} \leftarrow \text {MAX}\Big (\{\sigma ({\textbf{W}}_\mathrm{{pool}}{\textbf{h}}^{k-1}_{u_i} + {\textbf{b}}), \forall u_i \in {\mathcal {N}}(v)\}\Big ). \end{aligned}$$
(5)

\({ \mathop { \texttt {GraphSAGE} } }\) aims at learning an aggregator instead of learning a representation for each node. This idea can improve the flexibility and generalization ability of the model. In addition, thanks to its flexibility, it can train the model in batches to improve the convergence speed. It is important to note that, \({ \mathop { \texttt {GraphSAGE} } }\) consistently outperforms state-of-the-art baselines in \({ \mathop { \texttt {GNN} } }\) [1]. Although \({ \mathop { \texttt {GraphSAGE} } }\) and other existing neighbourhood aggregation methods have achieved good performance on graph representation learning, they are not able to aggregate nodes that are similar but are far away from each other. Moreover, these methods overlook the fact that different nodes in the neighbourhood may have different influences on the node. We will address these issues with our proposed \({ \mathop { \texttt {EPGE} } }\) method and further improve the performance of \({ \mathop { \texttt {GraphSAGE} } }\) in this study.

3.4 Existing and latent graphs

We define Existing Graph as:

Definition 3

(Existing Graph) An existing graph is a graph that can be created based on original connections between entities and is parametrized as: \({\mathcal {G}}^e = ({\mathcal {V}},{\mathcal {E}}_{{\mathcal {G}}^e},{\mathcal {P}})\), where \({\mathcal {V}}\) are the nodes, \({\mathcal {E}}_{{\mathcal {G}}^e}\) correspond to their relationships and \({\mathcal {P}}\) denotes the properties of nodes.

The existing graph defined above is fundamental to describe the relationship of the nodes, in which the node representation can be efficiently improved by aggregating features from its neighbours. However, it is unable to capture the long-distance dependencies between nodes with similar properties when they are far away in the existing graph. In order to address this issue, we define Latent Graph as:

Definition 4

(Latent Graph) A latent graph is parametrized as: \({\mathcal {G}}^l = ({\mathcal {V}},{\mathcal {E}}_{{\mathcal {G}}^l}, {\mathcal {P}}, \lambda _{{\mathcal {G}}^l})\) with nodes \({\mathcal {V}}\) and links \({\mathcal {E}}_{{\mathcal {G}}^l}\), where the link between two nodes \(u, v \in {\mathcal {V}}\) exists if the similarity between the nodes exceeds a certain pre-defined threshold \(\lambda _{{\mathcal {G}}^l}\). \({\mathcal {P}}\) represents the property of nodes as the same meaning in the \({\mathcal {G}}^e\).

It can be seen that edges in the latent graph depend on the similarity between nodes and are dominated by a pre-defined threshold. Of course, one can control the number of edges by changing this threshold.

4 Methodology

Let us discuss our proposed \({ \mathop { \texttt {EPGE} } }\) framework in this section. We will start by discussing the framework, followed by our discussion of latent graph construction. Later, we will discuss various features of our proposed formulation.

4.1 EPGE framework

Fig. 3
figure 3

Pictorial illustration of the of \({ \mathop { \texttt {EPGE} } }\) framework

Our proposed framework is illustrated in Fig. 3  which depicts various steps in our formulation. First, the existing graph \({\mathcal {G}}^e\) is obtained along with the property vector of nodes. Second, two nodes are similar if the similarity of their property vectors exceeds a pre-defined threshold \(\lambda _{{\mathcal {G}}^l}\), leading to a latent edge between the two nodes. These latent edges and the corresponding nodes form the latent graph \({\mathcal {G}}^l\). Third, a biased neighbourhood sampling strategy is implemented. Neighbours that are more similar to the node have higher priority to be aggregated until a fixed-size set of neighbours is obtained. Note, in this step, the neighbourhood contains the immediate neighbours in the existing graph and the latent neighbours in the latent graph. Finally, a node embedding \({\textbf{x}}_v\) is obtained by aggregating the property vectors of the neighbours and itself.

4.1.1 Latent graph construction

Let us discuss the creation of the latent graph. For the existing property graph \({\mathcal {G}}^e\), we have nodes \(v_i \in {\mathcal {V}}\) and edges \((v_i, v_j) \in {\mathcal {E}}_{{\mathcal {G}}^e}\). The property vector representation of node \(v \in {\mathcal {V}}\) is denoted as \({\textbf{z}}_v\)Footnote 1. For \(v, u \in {\mathcal {V}}\), their similarity and the latent edge are defined as:

$$\begin{aligned} \begin{aligned}&S(v, u) = \textrm{Pearson Similarity} ({\textbf{z}}_v, {\textbf{z}}_u); \\&{\mathcal {E}}_{{\mathcal {G}}^l} (u,v) = {\left\{ \begin{array}{ll} 1, \quad \text { if } \quad S(u,v) \ge \lambda _{{\mathcal {G}}^l} \\ 0, \quad \text { otherwise.} \end{array}\right. } \end{aligned} \end{aligned}$$
(6)

Here, S(uv) denotes the similarity between two nodes—if it exceeds the threshold \(\lambda _{{\mathcal {G}}^l}\), these two nodes are linked by a latent edge, which finally creates a latent graph \({\mathcal {G}}^l\). One can use any form of similarity measure such as Pearson, Spearman and dot product. Here, we adopt Pearson as the default measure as it works the best in practice. Importantly, we analyse the effect of the parameter \(\lambda _{{\mathcal {G}}^l}\) setting for the model performance in Sect. 6.7.1.

4.1.2 Bias sampling neighbourhood

Many real-world graphs have high-degree nodes, i.e. nodes with a large number of neighbours. Considering all neighbours for aggregation is usually inefficient and unnecessary [30]. Given that a node’s neighbours in a graph have no natural ordering, though they do in sentences, images, etc., GraphSAGE [1] uniformly samples a fixed-size set of neighbours. It has been demonstrated that aggregating neighbour information is effective in graph representation learning [1]. However, not all neighbours of a node can provide positive information. Therefore, our proposed framework \({ \mathop { \texttt {EPGE} } }\) improves \({ \mathop { \texttt {GraphSAGE} } }\) by deriving a set of sampled neighbours based on their similarity. The intuition is that similar neighbours (similar in any type of property) can consolidate and enhance the node embedding results. In other words, in \({ \mathop { \texttt {EPGE} } }\) model, neighbours that are more similar to the node being processed have higher priority to be aggregated. We will discuss how to evaluate the benefit of this strategy in Sect. 5 and provide the experimental analysis in Sect. 6.8.

4.1.3 Multi-modality neighbourhood aggregation

The neighbourhood \({\mathcal {N}}(v) = \Big \{{\mathcal {N}}_e(v),{\mathcal {N}}_l (v)\Big \}\) of node v includes its neighbourhoods in both the existing graph and the latent graph. The existing neighbourhood \({\mathcal {N}}_e(v)\) consists of the set of v’s adjacent nodes in the existing graph \({\mathcal {G}}^e\), and the latent neighbourhood \({\mathcal {N}}_l(v)\) is those whose similarity to node v is higher than a parameter \(\lambda _{{\mathcal {G}}^l}\). We regard these two types of neighbourhood as multi-modality neighbourhood and discuss the fusion of multi-modality neighbourhood in this section. During the process of aggregation, we combine the existing neighbourhood and the latent neighbourhood to generate the node embeddings. The motivation is that different types of neighbours will make different contributions to the final node representations. For the existing neighbourhood, it denotes the effect of user’s original nature. In comparison with this easily presented relationship, the latent neighbourhood indicates the long-range dependencies with the node, which is invisible and cannot be captured directly. Here, we have two approaches to aggregating the existing neighbourhood and the latent neighbourhood. One is to treat these two modalities of neighbourhood equally and sample the neighbours completely according to the similarity with the node in problem. Another approach is that the existing neighbourhood is given priority to be sampled, while the latent neighbourhood is regarded as a supplement until a fixed-size set of neighbours is obtained. In the experiment part, we adopt the later approach.

4.2 The algorithm of EPGE

Algorithm 1 describes the overall procedure of our proposed \({ \mathop { \texttt {EPGE} } }\) framework. From steps 3 to 9, the similarity of nodes is calculated, and the latent neighbours are constructed based on similarity. In step 11, when the size of the existing neighbourhood is less than \(\beta \) (the predetermined size of neighbours to be sampled), we select \(\beta - \vert {\mathcal {N}}_e(v) \vert \) neighbours from the latent neighbours as supplement. The latent neighbours are sorted to obtain the neighbours with the highest similarity. Next, for each node \(v \in {\mathcal {V}}\), it aggregates the representations of its sampled neighbourhood, \(\{{\textbf{h}}^{k-1}_u, \forall u \in {\mathcal {N}}^s{(v)}\}\), represented in step 15. Here, AGGREGATOR is one of the five candidate aggregator functions introduced in Sect. 3.3. Then, we use CONCAT operation to concatenate the node’s sampled neighbourhood \( {\textbf{h}}^k_{{\mathcal {N}}^s{(v)}}\) and its current representation \({\textbf{h}}^{k-1}_v\) (Step 16). This concatenated vector is fed through a fully connected layer with a nonlinear activation function \(\sigma \). In this process, we can adopt the five aggregators described in Sect. 3.3. Finally, we get the final representations output at depth K, denoted as \({\textbf{x}}_v \leftarrow {\textbf{h}}^K_v, \forall v \in {\mathcal {V}} \) (Step 21). According to the learned nodes representation, the classification task of graph nodes can be conducted.

Algorithm 1
figure g

The \({ \mathop { \texttt {EPGE} } }\) Algorithm

4.3 Model training

In this study, we take node classification in a supervised setting as the specific downstream task. The cross-entropy is applied as the loss function of the model. With the labelled nodes, we train \({ \mathop { \texttt {EPGE} } }\) by minimizing the cross-entropy via back-propagation and gradient descent. Thus, the loss function is calculated as:

$$\begin{aligned}{} & {} {{\mathcal {L}}} = \sum _{v \in {\mathcal {V}}}{\Big (y_v{\log {p_v}} + (1-y_v) \log (1-p_v)\Big )}, \;\; \nonumber \\{} & {} \textrm{where} \;\;\; {p_v} = \sigma ({\textbf{w}}^T {\textbf{x}}_v+b). \end{aligned}$$
(7)

This cross-entropy loss function compares the prediction results of the model with the real labels of the data in the classification task. Here, \({\textbf{x}}_v\) is the final node representation of the node v obtained from Algorithm 1, \(p_v\) is the predicted probability of node v and \(y_v\) is the ground truth.

5 Label consistency metric

\({ \mathop { \texttt {GNN} } }\) models obtain the node representation by collecting information from the neighbourhood. However, in practice, not all neighbours of a node contain relevant information, which means that some neighbours may convey positive information to the node and some neighbours may provide negative disturbance. For the node classification task, it is reasonable to consider that neighbours with the same class label of the target node can contribute positive information. In our proposed \({ \mathop { \texttt {EPGE} } }\) framework, based on \({ \mathop { \texttt {GraphSAGE} } }\), we improve the performance by introducing the latent neighbourhood and adopting the biased sampling strategy. The purpose of adopting these two strategies is to choose neighbourhood which is helpful to node representation. In order to interpret why our proposed method can better select neighbourhood and thus achieve better performance, we introduce a metric named label consistency, which quantitatively measures the usefulness of the sampled neighbourhood information.

Consider the node classification task where each node \( v \in {\mathcal {V}} \) has a label \(y_v\), we say \(v_i \simeq v_j\) if \(y_{v_i} = y_{v_j}\). Then, we define the label consistency metric as:

$$\begin{aligned} \tau = \sum _{e_{v_i,v_j} \in {\mathcal {N}}^s_{(v)}} \Big (\small {{\mathbb {I}}}(v_i \simeq v_j)/|{\mathcal {N}}^s{(v)} |\Big ), \end{aligned}$$
(8)

where \({{\mathbb {I}}}(v_i \simeq v_j)\) is an indicator function, representing the count of \(v_i \simeq v_j\). \({\mathcal {N}}^s{(v)}\) is the sampled neighbourhood. A larger \(\tau \) implies that neighbours with the same labels tend to be sampled, in which case the surrounding contributes more positive information for the node representation. Therefore, the larger the \(\tau \), the better the sampled neighbours for node representation learning. In Sect. 6.8, we calculate this metric of \({ \mathop { \texttt {EPGE} } }\) and \({ \mathop { \texttt {GraphSAGE} } }\) on the experimental datasets, which explains, to some extent, the superiority of the proposed framework \({ \mathop { \texttt {EPGE} } }\) than \({ \mathop { \texttt {GraphSAGE} } }\).

6 Experiment and analysis

6.1 Datasets

Table 1 Statistic of datasets

We evaluate our proposed method on five public datasets, which are widely used for \({ \mathop { \texttt {GNN} } }\) node classification. The statistics of these datasets are summarized in Table 1. The details of these datasets are as follows:

  • Cora [31] is citation network dataset consisting of machine learning papers as nodes and the citation relationships as edges. Those papers generate a vocabulary of 1433 unique words after stemming and removing stop-words. Each paper is represented with binary values indicating whether each word in the vocabulary is present (indicated by 1) or absent (indicated by 0) in the paper.

  • CiteSeer [31] is another citation network dataset, in which documents and citations are treated as nodes and edges. CiteSeer papers generate a vocabulary of 3703 unique words. The same as Cora Dataset, the property vectors of nodes are presented by these words.

  • PubMed [32] is a citation network from the PubMed database, which contains a set of articles (nodes) related to diabetes and the citation relationships (edges) among them. The node properties are TF-IDF representation for the article, and the node labels are the diabetes type addressed in the articles.

  • PPI [1] is a biological graph of Protein–Protein Interactions (predicting protein functions). The node’s properties include positional gene set, motif set and immunological features.

  • HateUser contains a network of 100k users, among them about 5k were annotated to be either hateful or not. If one user has retweeted the post of another user, such a retweet connection is represented as the edge in this dataset. The properties of users could be content-related, activity-related, sentiment-related, etc.

6.2 Baselines

We compare our proposed model with ten baselines of graph representation learning. DeepWalk [15] and Node2vec [16] are the representatives of random walk-based methods for graph representation learning. The recent models of graph convolutional networks include  GCN [6],  GAT [19],  JK-LSTM [2],  H-GCN [25],  N-GCN [27],  DAGNN [26] and Geom-GCN [33]. GraphSAGE is the state of the art based on neighbour aggregation. As our proposed model \({ \mathop { \texttt {EPGE} } }\) mainly makes improvement based on \({ \mathop { \texttt {GraphSAGE} } }\), we conduct a more detailed comparative analysis between \({ \mathop { \texttt {EPGE} } }\) and \({ \mathop { \texttt {GraphSAGE} } }\). Let us discuss the details of baseline approaches in the following:

  • DeepWalk [15]—A skip-gram model to learn node embeddings by capturing the relationships between nodes based on random walk paths.

  • Node2vec [16]—A method considers both graph homophily and structural equivalence by combining breadth-first random walk and depth-first random walk.

  • GCN [6]—A scalable implementation of \({ \mathop { \texttt {GCN} } }\) via a localized first-order approximation of spectral graph convolutions.

  • GAT [19]—A graph neural network architectures that assigns different importance to different neighbours by utilizing self-attention mechanism and then combines their impacts to generate node embeddings.

  • JK-LSTM [2]—This model proposed the Jumping Knowledge Networks that selectively combine different aggregations at the last layer.

  • H-GCN [25]—This work designed a graph coarsening layer to aggregate nodes with similar structures to hyper-nodes for enlarging the receptive field for each node and improving the performance.

  • N-GCN [27]—This method trained multiple instances of \({ \mathop { \texttt {GCN} } }\) over node pairs discovered at different distances in random walks and learned a combination of the instance outputs.

  • DAGNN [26]—A deep adaptive graph neural network is presented by decoupling the representation transformation and propagation operations, which can ease the over-smoothing issue.

  • Geom-GCN [33]—This scheme first maps a graph to a continuous latent space via node embedding, then uses the geometric relationships defined in the latent space to build structural neighbourhoods for aggregation and then designs a bi-level aggregator operating on the structural neighbourhoods to update the feature representations of nodes in graph neural networks.

  • GraphSAGE [1]—GraphSAGE sampled and aggregated features from a node’s local neighbourhood, instead of training individual embeddings for each node. \({ \mathop { \texttt {GraphSAGE} } }\) reported the experimental results of four aggregators, namely GraphSAGE-GCNGraphSAGE-MEANGraphSAGE-LSTM, and GraphSAGE-MaxPooling.

6.3 Experiment setup

We implement our method using TensorFlow. We have used a standard setup that has been used in the evaluation of other models. To evaluate the performances, we split the datasets into the training set, validation set and testing set with the approximate proportion of \(60\%\), \(20\%\) and \(20\%\), respectively. We use the validation set for hyper-parameter tuning and early stopping and the test set only to report the performance. Throughout all the experiments, we use the Adam optimizer with the learning rate as 0.001 and dropout rate as 0.2. The threshold \(\lambda _{{\mathcal {G}}^l}\) of latent connection construction is set to 0.5 as the default. We report F1-micro and F1-macro on the test set of each dataset to evaluate the performance of node classification in the property graphs.

6.4 Code

The code of this work can be downloaded from: https://anonymous.4open.science/r/EPGE-open-code-4C05.

6.5 Results of node classification

In this section, we will compare \({ \mathop { \texttt {EPGE} } }\) with several competing algorithms and perform a more detailed comparative analysis between \({ \mathop { \texttt {EPGE} } }\) and \({ \mathop { \texttt {GraphSAGE} } }\).

6.5.1 Comparison with baselines

We first report node classification accuracy results compared with baselines in Table 2. To ensure a fair comparison, we use the results reported in their respective papers. Therefore, some results are missing because they have not been applied to this collection of datasets. Note that it is common in graph embedding research to use the results on standard datasets, published in original papers, given the experimental setup is consistent across various papers.

From Table 2, we can see that \({ \mathop { \texttt {EPGE} } }\) achieves higher F1-micro scores than all the other methods for all datasets, especially on PubMed and PPI for which performance improvements are significant. The methods that use node property information (i.e. \({ \mathop { \texttt {EPGE} } }\), \({ \mathop { \texttt {GraphSAGE} } }\) and GCN) achieve better performance than the methods that use the skip-gram model to capture the structure relationships (i.e. DeepWalk and node2vec). In addition, compared with GraphSAGE and GCN-related models (e.g. H-GCN, N-GCN), \({ \mathop { \texttt {EPGE} } }\) further improves the accuracy of node classification, which highlights the effectiveness of our proposed \({ \mathop { \texttt {EPGE} } }\) framework.

Table 2 Comparison of EPGE with baselines
Table 3 Comparison of EPGE with GraphSAGE

6.5.2 Comparison with GraphSAGE

Since our model is conceptually aligned with \({ \mathop { \texttt {GraphSAGE} } }\), it is necessary to compare with \({ \mathop { \texttt {GraphSAGE} } }\). To do this, we implement \({ \mathop { \texttt {EPGE} } }\) and \({ \mathop { \texttt {GraphSAGE} } }\) with five different aggregators, namely MEANGCNLSTM,  MeanPooling and MaxPooling aggregators. Table 3 summarizes the results of \({ \mathop { \texttt {GraphSAGE} } }\) and \({ \mathop { \texttt {EPGE} } }\) for node classification on five public datasets (better results are highlighted in bold).

It is noteworthy that \({ \mathop { \texttt {EPGE} } }\) generally achieves better performance than standard \({ \mathop { \texttt {GraphSAGE} } }\), especially on PPI and HateUser. Specifically, \({ \mathop { \texttt {EPGE} } }\) achieves higher F1-micro and F1-macro scores than the corresponding GraphSAGE methods except for the Cora dataset. Overall, it is very encouraging to note that our proposed \({ \mathop { \texttt {EPGE} } }\) model can greatly improve the performance of \({ \mathop { \texttt {GraphSAGE} } }\) on the task of node classification in the property graphs.

6.6 Ablation study

Table 4 The results of ablation study

To verify the effectiveness of the proposed latent graph construction and biased sampling strategies, we conduct an ablation study in this section and report both F1-micro and F1-macro as the evaluating metric. In this section, \({ \mathop { {\texttt {EPGE}_{\backslash \texttt {L}}} }}\) refers to \({ \mathop { \texttt {EPGE} } }\) without latent connections and \({ \mathop { {\texttt {EPGE}_{\backslash \texttt {B}}} }}\) represents \({ \mathop { \texttt {EPGE} } }\) without biased sampling. The results are shown in Table 4, where the top table presents the F1-micro values and the bottom table reports the F1-macro values. Also, the performance degradation (\(\downarrow \)) or improvement (\(\uparrow \)) without two strategies of \({ \mathop { \texttt {EPGE} } }\) are given in parentheses.

6.6.1 The strategy of latent graph construction

We remove the latent graph from the \({ \mathop { \texttt {EPGE} } }\) but preserve the biased sampling strategy. From the results in Table 4, except on PPI dataset, \({ \mathop { {\texttt {EPGE}_{\backslash \texttt {L}}} }}\) (\({ \mathop { \texttt {EPGE} } }\) without latent graph) generally shows performance degradation than \({ \mathop { \texttt {EPGE} } }\), with maximum degradation of \(29.00\%\) in F1-micro and \(25.10\%\) in F1-macro on HateUser. It can be seen that \({ \mathop { \texttt {EPGE} } }\) with certain aggregators yields slightly inferior performance than \({ \mathop { {\texttt {EPGE}_{\backslash \texttt {L}}} }}\) on CoraCiteSeer and HateUser datasets (i.e. F1-micro of \({ \mathop { \texttt {EPGE-MEAN} } }\) and \({ \mathop { \texttt {EPGE-MeanPooling} } }\) on Cora dataset, F1-micro of \({ \mathop { \texttt {EPGE-MEAN} } }\), \({ \mathop { \texttt {EPGE-GCN} } }\) and \({ \mathop { \texttt {EPGE-MaxPooling} } }\) on CiteSeer Dataset, F1-macro of \({ \mathop { \texttt {EPGE-MaxPooling} } }\) on CiteSeer Dataset, F1-micro and F1-macro of \({ \mathop { \texttt {EPGE-MeanPooling} } }\) and \({ \mathop { \texttt {EPGE-MaxPooling} } }\) on PubMed Dataset). However, the performance degradation never exceeds more than \(1\%\). These ablation studies reveal the efficacy of latent connections in learning node embedding especially for datasets where the number of edges is not many.

6.6.2 The biased sampling strategy

To study the impact of biased sampling strategy, we compare \({ \mathop { \texttt {EPGE} } }\) with no biased sampling. It can be seen from the results that the biased sampling strategy outperform \({ \mathop { {\texttt {EPGE}_{\backslash \texttt {B}}} }}\) (\({ \mathop { \texttt {EPGE} } }\) without biased sampling) on all five datasets. Especially, for \({ \mathop { \texttt {EPGE-MaxPooling} } }\) on PPI dataset, performance improvement can be achieved up to \(26.00\%\) and \(38.70\%\) in F1-micro and F1-macro, respectively. For HateUser dataset, there are \(25.68\%\) and \(27.07\%\) performance improvement in F1-macro for \({ \mathop { \texttt {EPGE-MEAN} } }\) and \({ \mathop { \texttt {EPGE-GCN} } }\), respectively. All experimental results verify the contributions of adopting a biased sampling strategy in learning the node embedding.

6.7 Sensitivity analysis

In this set of experiments, we evaluate the effects of some important parameters in \({ \mathop { \texttt {EPGE} } }\) on its performance, including the parameter of latent connection construction, the sampling size of neighbours and the number of edges in the property graph.

6.7.1 The parameter of latent connection construction

Table 5 The analysis of the parameter of latent connection construction

In our proposed model, the latent neighbourhood is determined based on the Pearson similarity. As we discussed, when the similarity between two nodes exceeds the threshold \(\lambda _{{\mathcal {G}}^l}\), they are linked by a latent edge, which finally creates a latent graph. In this section, we probe the influence of the threshold \(\lambda _{{\mathcal {G}}^l}\) on the model’s performance. The results are presented in Table 5, where we present the results with two different threshold values, i.e. \(\lambda _{{\mathcal {G}}^l}\) is set to 0.5 and 0.8.

Compared with \({ \mathop { \texttt {GraphSAGE} } }\), \({ \mathop { \texttt {EPGE} } }\) with two different thresholds has better performance on all datasets with just one exception (\({ \mathop { \texttt {EPGE-MaxPooling} } }\) on Cora dataset). A comparison between the performance of \(\lambda _{{\mathcal {G}}^l}=0.5\) and \(\lambda _{{\mathcal {G}}^l}=0.8\) reveals an interesting pattern. It can be seen that \({ \mathop { \texttt {EPGE} } }\) with MEAN or LSTM aggregators tends to perform better with \(\lambda _{{\mathcal {G}}^l}=0.5\), whereas \({ \mathop { \texttt {EPGE} } }\) with GCN or MeanPooling aggregators performs better when \(\lambda _{{\mathcal {G}}^l}=0.8\). We recommend that one should adjust the threshold \(\lambda _{{\mathcal {G}}^l}\) on a validation set instead of setting just one value, as different values are useful for different aggregators as well as datasets.

6.7.2 The setting of the sampling size

Fig. 4
figure 4

The setting of the sampling size on five datasets

In this section, we investigate the influence of the size of sampling neighbours on the model performance. Hamilton et al. [1] recommended the depth of neighbourhood \(K=2\) with neighbourhood sample sizes \(S_1=25\) and \(S_2=10\). Here, \(S_1\) and \(S_2\) are the numbers of the sampled neighbours during iteration \(k=1\) and during iteration \(k=2\) of Algorithm 1, respectively. In this experiment, we set the default value for K to 2 and have compared our \({ \mathop { \texttt {EPGE} } }\) model with \({ \mathop { \texttt {GraphSAGE} } }\) by varying the neighbourhood sample sizes \(\{S_1,S_2\}\) to two sets of parameters (\(\{25,10\}\) and \(\{10,5\}\)). The results on five datasets are presented in Fig. 4. For each dataset, we present the F1-micro scores of models with five aggregators. A similar pattern of F1-macro results were observed but not shown due to space constraints. For each aggregator, we present four sets of results, namely, \({ \mathop { \texttt {GraphSAGE} } }\) with parameters \(\{S_1,S_2\}=\{25,10\}\), \({ \mathop { \texttt {GraphSAGE} } }\) with parameters \(\{S_1,S_2\}=\{10,5\}\), \({ \mathop { \texttt {EPGE} } }\) with parameters \(\{S_1,S_2\}=\{25,10\}\) and \({ \mathop { \texttt {EPGE} } }\) with parameters \(\{S_1,S_2\}=\{10,5\}\).

Comparing the results with five aggregators on five datasets, it can be seen that in almost all figures, \({ \mathop { \texttt {EPGE} } }\) with \(\{S_1,S_2\}=\{25,10\}\) generally achieves the best performance, which is consistent with the results in Sect. 6.5. In addition, whether \(\{S_1,S_2\}=\{25,10\}\) or \(\{S_1,S_2\}=\{10,5\}\) the results of \({ \mathop { \texttt {EPGE} } }\) are generally better than that of \({ \mathop { \texttt {GraphSAGE} } }\). Now turning to the experimental evidence on \({ \mathop { \texttt {EPGE} } }\) when \(\{S_1,S_2\}=\{10,5\}\) and \({ \mathop { \texttt {GraphSAGE} } }\) when \(\{S_1,S_2\}=\{25,10\}\), we can see that the former has a comparable performance to the latter, or even better, especially on PubMedPPI and HateUser datasets. It is encouraging to see that a small sampling size for \({ \mathop { \texttt {EPGE} } }\) is able to maintain promising results that was achieved by \({ \mathop { \texttt {GraphSAGE} } }\) with larger neighbours.

6.7.3 The influence of the edges

In \({ \mathop { \texttt {GraphSAGE} } }\) model, with more edge relationships, nodes can more likely learn informative messages from their neighbourhood, resulting in better performance for node classification. On the other hand, it will face performance deterioration if no sufficient edges are present. Of course, our proposed \({ \mathop { \texttt {EPGE} } }\) model can help. In this section, we will evaluate the performance of \({ \mathop { \texttt {GraphSAGE} } }\) and \({ \mathop { \texttt {EPGE} } }\) when there are very few edges in the property graph. To conduct this experiment, we randomly drop edges at a proportion of \(20\%\), \(40\%\) and \(60\%\), respectively, and then test the models’ performance on the adjusted graph. The results on five datasets are shown in Fig. 5. We present F1-micro scores of \({ \mathop { \texttt {EPGE} } }\) and \({ \mathop { \texttt {GraphSAGE} } }\) models with five aggregators (The F1-macro results show a similar pattern with F1-micro). In these results, we use \({ \mathop { \texttt {GraphSAGE} } }\) with original edges (i.e. droprate = 0) as the baseline and compare \({ \mathop { \texttt {GraphSAGE} } }\) and \({ \mathop { \texttt {EPGE} } }\) at droprate = 0.2, droprate = 0.4, droprate = 0.6 with this baseline.

The results indicate that \({ \mathop { \texttt {GraphSAGE} } }\) generally suffers a performance decrease with higher droprate and \({ \mathop { \texttt {EPGE} } }\) shows a similar situation. However, when comparing \({ \mathop { \texttt {GraphSAGE} } }\) and \({ \mathop { \texttt {EPGE} } }\) with the same droprate, \({ \mathop { \texttt {GraphSAGE} } }\) show worse results, which means that our method outperforms \({ \mathop { \texttt {GraphSAGE} } }\) in graphs with much fewer edges, demonstrating the effectiveness of latent connections and bias sampling strategies of \({ \mathop { \texttt {EPGE} } }\).

Fig. 5
figure 5

The influence of the edges on five datasets

6.8 Label consistency analysis

As presented in Sect. 5, the label consistency metric \(\tau \) evaluates the usefulness of the sampled neighbourhood information. Table 6 reports the values of \(\tau \) in \({ \mathop { \texttt {GraphSAGE} } }\) and \({ \mathop { \texttt {EPGE} } }\) models on five datasets. It can be seen that \({ \mathop { \texttt {EPGE} } }\) has a much higher \(\tau \) value than \({ \mathop { \texttt {GraphSAGE} } }\), which implies that the node and its sampled neighbours have more of the same labels than that in \({ \mathop { \texttt {GraphSAGE} } }\). Based on these results, it can clearly be seen that our proposed \({ \mathop { \texttt {EPGE} } }\) has better neighbours selecting and aggregating strategy, and this is the main reason why our method can obtain encouraging performances.

Table 6 The Label Consistency Metric \(\tau \) of \({ \mathop { \texttt {GraphSAGE} } }\) and \({ \mathop { \texttt {EPGE} } }\)

7 Conclusions and future work

Most graphs in the real world are property graphs, because other than containing the structure information, rich property information exists for each node in the graphs. The early graph representation learning based on random walk focused only on the structure of graph for learning the node embedding, but overlooked the significance of the properties of nodes. Although Graph Neural Networks use the properties as the initial features of nodes and then aggregate feature information of the neighbours, they fail to capture implicit/latent relationships among the nodes, which is implicit in the given structure. To address those limitations in existing methods, we propose a novel framework for property graph representation learning—\({ \mathop { \texttt {EPGE} } }\), which not only exploits the existing graph but also builds a latent graph based on the similarity between nodes in the graph. This new latent connection has the ability to capture the long-distance dependencies from nodes with similar properties but far away in the graph. More importantly, the property graph we construct is simplified into a homogeneous graph, which is simpler and more efficient than complex heterogeneous graphs, hence requiring less memory and computational resources. In addition, \({ \mathop { \texttt {EPGE} } }\) has an effective neighbour sampling technique that can choose informative features from neighbours. On five publicly available graph datasets, the proposed model outperforms the state-of-the-art methods including GraphSAGE for the task of node classification. We further confirmed the superiority of our proposed formulation through a novel quantitative metric for the usefulness of the sampled neighbourhood in the graph.