Keywords

1 Introduction and Related Literature

Humans inhabit in a planet comprised of several systems and ecosystems; and interaction is a natural phenomenon and characteristic obtainable in any given system or ecosystem. Thus, relationships between constituent entities in a given system/ecosystem is a strategy for survival and essentiality for the sustenance of a given system/ecosystem. Owing to the recent AI advances, these real-world complex systems and ecosystems can be effectively modelled as social network structures for analysis. Social network graphs [22] are intricate structures which present analytical challenges to Machine Learning (ML) and Deep Learning (DL) models because of their dynamic nature, complex links, and occasionally massive size. In this regard, we have proposed a new hybrid and DL-based model (RLVECN) based on an iterative learning approach [1] for solving (node) classification as well as clustering problems in SNA via an edge sampling strategy.

In SNA, the classification of nodes induces the formation of cluster(s). Consequently, clusters give rise to homophily in social networks. Basically, learning in RLVECN is induced via semi-supervised training. The architecture of RLVECN comprises two (2) distinct Representation Learning (RL) layers, viz: a Knowledge-Graph Embeddings (VE) layer and a Convolutional Neural Network (ConvNet) layer [13]. Both of these layers are trained by means of unsupervised learning. These layers are essentially feature-extraction and dimensionality-reduction layers where underlying knowledge and viable facts are automatically extracted from the social network structure [15]. The vector-embedding layer is responsible for projecting the feature representation of the social graph to a q-dimensional real-number space, \( \mathbb {R}^q \). This is done by associating a real-number vector to every unique actor/node in the social network such that the cosine distance of any given tie/edge (a pair of actors) would capture a significant degree of correlation between the two associated actors. Furthermore, the ConvNet layer, which feeds on the vector-embedding layer, is responsible for further extraction of apparent features and/or representations from the social graph. Finally, a classification layer succeeds the RL layers; and it is trained by means of supervised learning. The classifier is based on a Neural Network (NN) architecture assembled using multiple and deep layers of stacked perceptrons (NN units) [6]. Every low-dimensional feature (X), extracted by the representation-learning layers, is mapped to a corresponding output label (Y); and these (XY) pairs are used to supervise the training of the classifier such that it can effectively and efficiently learn how to identify clusters and classify actors within a given social network structure. Hence, the novelty of our research contribution are as stated below:

  1. (1)

    Proposition of a DL-based and hybrid model, RLVECN, which is aimed at solving node classification problems in SNA.

  2. (2)

    Detailed benchmarking reports with respect to classic objective functions used for classification tasks.

  3. (3)

    Comparative analysis, between RLVECN and state-of-the-art methodologies, against standard real-world social networks.

RLVECN is capable of learning the non-linear distributed features enmeshed in a social network [9]. We have evaluated RLVECN against an array of state-of-the-art models and RL methodologies which serve as our baselines, viz:

  1. (i)

    DeepWalk: Online Learning of Social Representations [19].

  2. (ii)

    GCN: Semi-Supervised Classification with Graph Convolutional Networks [11].

  3. (iii)

    LINE: Large-scale Information Network Embedding [25].

  4. (iv)

    Node2Vec: Scalable Feature Learning for Networks [8].

  5. (v)

    SDNE: Structural Deep Network Embedding [26].

2 Proposed Methodology and Framework

2.1 Definition of Problem

Definition 1

Social Network, SN: As expressed via Eq. 1 such that SN is a tuple comprising a set of actors/vertices, V; a set of ties/edges, E; a metadata function, \( f_V \), which extends the definition of the vertices’ set by mapping it to a given set of attributes, \( V^{\prime } \); and a metadata function, \( f_E \), which extends the definition of the edges’ set by mapping it to a given set of attributes, \( E^{\prime } \). Thus, a graph function, \( G (V, E) \subset SN \)

$$\begin{aligned} \begin{aligned} SN&= (V, E, f_V, f_E) \equiv (G, f_V, f_E) \\ V&: |\{V\}| = M \qquad \qquad \;\;\text {set of actors/vertices with size, M} \\ E&: E \subset \{U \times V\} \subset \{V \times V\}\;\text {set of ties/edges between V} \\ f_V&: V \rightarrow V^{\prime }\qquad \qquad \qquad \qquad \text {vertices' metadata function} \\ f_E&: E \rightarrow E^{\prime } \qquad \qquad \qquad \qquad \text {edges' metadata function} \\ \end{aligned} \end{aligned}$$
(1)

Definition 2

Knowledge Graph, KG: \( (\mathbb {E}, \mathbb {R}) \) is a set comprising entities, \( \mathbb {E} \), and relations, \( \mathbb {R} \), between the entities. Thus, a KG [24, 28] is defined via a set of triples, t : (upv) , where \( u, v \in \mathbb {E} \) and \( p \in \mathbb {R} \). Also, a KG [27] can be modelled as a social network, SN, such that: \( \mathbb {E} \rightarrow V \) and \( \mathbb {R} \rightarrow E \) and \( (\mathbb {E}, \mathbb {R}) \vdash f_V, f_E \).

Definition 3

Knowledge-Graph (Vector) Embeddings, X: The vector-space embeddings, X, generated by the embedding layer are based on a mapping function, f, expressed via Eq. 2. f projects the representation of the graph’s actors to a q-dimensional real space, \( \mathbb {R}^q \), such that the existent ties between any given pair of actors, \( (u_i, v_j) \), remain preserved via the homomorphism from V to X.

$$\begin{aligned} \begin{aligned} f&: V \rightarrow X \in \mathbb {R}^q \\ f&: (u, p, v) \rightarrow X \in \mathbb {R}^q \quad \text {Knowledge-Graph Embeddings} \end{aligned} \end{aligned}$$
(2)

Definition 4

Node Classification: Considering, SN, comprising partially labelled actors (or vertices), \( V_{lbl} \subset V : V_{lbl} \rightarrow Y_{lbl} \); and unlabelled vertices defined such that: \( V_{ulb} = V - V_{lbl} \). A node-classification task aims at training a predictive function, \( f : V \rightarrow Y \), that learns to predict the labels, Y, for all actors or vertices, \( V \subset SN \), via knowledge harnessed from the mapping: \( V_{lbl} \rightarrow Y_{lbl} \).

Fig. 1.
figure 1

Node classification task in social graphs

2.2 Proposed Methodology

Our proposition, RLVECN, is comprised of two (2) distinct Feature Learning (FL) layers, and one (1) classification layer (Fig. 2).

Representation Learning - Knowledge-Graph Embeddings Layer: Given a social network, SN, defined by a set of actors/vertices, \( V : U \subset V \, \forall \, \{u_m, v_m\} \in V \), and \( M : m \in M \) denotes the number of unique actors in SN. Additionally, let the ties/edges in SN be defined such that: \( E \subset \{U \times V\} \); where \( u_i \in V \) and \( v_j \in V \) represent a source_vertex and a target_vertex in E, respectively.

The objective function of the vector-embedding layer aims at maximizing the average logarithmic probability of the source_vertex, \( u_i \), being predicted as neighboring actor to the target_vertex, \( v_j \), with respect to all training pairs, \( \forall \, (u_i, v_j) \in E \). Formally, the function is expressed as in Eq. 3:

$$\begin{aligned} \begin{aligned} \mu = \dfrac{1}{M}\sum _{m=1}^{M}(\sum _{(u_i, v_j) \in E} log Pr(u_i|v_j)) \end{aligned} \end{aligned}$$
(3)

Consequently, in order to compute \( Pr(u_i|v_j) \), we have to quantify the proximity of each target_vertex, \( v_j \), with respect to its source_vertex, \( u_i \). The vector-embedding model measures this adjacency/proximity as the cosine similarity between \( v_j \) and its corresponding \( u_i \). Thus, the cosine distance is calculated as the dot product between the target_vertex and the source_vertex. Mathematically, \( Pr(u_i|v_j) \) is computed via a softmax function as defined in Eq. 4:

$$\begin{aligned} \begin{aligned} Pr(u_i|v_j) = \dfrac{exp(u_i \cdot v_j)}{\sum _{m=1}^{M}exp(u_m \cdot v_j)} \end{aligned} \end{aligned}$$
(4)

Hence, the objective function of our vector-embedding model with respect to the SN is as expressed by Eq. 5:

$$\begin{aligned} \begin{aligned} \sum _{(u_i, v_j) \in E} log Pr(u_i|v_j) = \sum _{(u_i, v_j) \in E} log \dfrac{exp(u_i \cdot v_j)}{\sum _{m=1}^{M}exp(u_m \cdot v_j)} \end{aligned} \end{aligned}$$
(5)

Representation Learning - ConvNet Layer: This layer comprises three (3) RL or FL operations, namely: convolution, non-linearity, and pooling operations. RLVECN utilizes a one-dimensional (1D) convolution layer [14] which is sandwiched between the vector-embedding and classification layers. Equation 6 expresses the 1D-convolution operation:

$$\begin{aligned} \begin{aligned} Feature Map (F) = 1D\_Input Matrix (X) * Kernel (K) \\ f_{i} = (X * K)_{i} = (K * X)_{i} = \sum _{j=0}^{J-1}x_{j} \cdot k_{i-j} = \sum _{j=0}^{J-1}k_{j} \cdot x_{i-j} \end{aligned} \end{aligned}$$
(6)

where \( f_{i} \) represents a cell/matrix position in the Feature Map; \( k_{j} \) denotes a cell position in the Kernel; and \( x_{i-j} \) denotes a cell/matrix position in the 1D-Input (data) matrix.

The non-linearity operation is a rectified linear unit (ReLU) function which introduces non-linearity after the convolution operation since real-world problems usually exist in non-linear form(s). As a result, the rectified feature/activation map is computed via: \( r_{i} \in R = g(f_{i} \in F) = max(0,F) \).

The pooling operation is responsible for reducing the input width of each rectified activation map while retaining its vital properties. In this regard, the Max Pooling function is defined such that the resultant pooled (or downsampled) feature map is generated via: \( p_{i} \in P = h(r_{i} \in R) = maxPool(R) \).

Classification - Multi-Layer Perceptron (MLP) Classifier Layer: This is the last layer of our proposed RLVECN’s architecture, and it succeeds the RL layers. The pooled feature maps, generated by the Representation Learning layers, contain high-level features extracted from the constituent actors of the social network structure. Hence, the classification layer utilizes these extracted “high-level features” for identifying clusters, based on the respective classes, contained in the social graph. In this regard, a MLP [5] function is defined as a mathematical function, \( f_{c} \), mapping some set of input values, P, to their respective output labels, Y. In other words, \( Y = f_{c}(P, \varTheta ) \), and \( \varTheta \) denotes a set of parameters. The MLP [4] function learns the values of \( \varTheta \) that will result in the best decision, Y, approximation for the input set, P. The MLP classifier output is a probability distribution which indicates the likelihood of a feature representation belonging to a particular output class. Our MLP [10] classifier is modelled such that sequential layers of NN units are stacked against each other to form a Deep Neural Network (DNN) structure [3, 16].

Node Classification Algorithm: Defined via Algorithm 1.

figure a

2.3 Proposed Architecture/Framework

Figure 2 illustrates the architecture of our proposition, RLVECN.

Fig. 2.
figure 2

Proposed system architecture

3 Data Sets and Materials

3.1 Data Sets

With regard to Table 1 herein, six (6) real-world benchmark social-graph data sets were employed for experimentation and evaluation, viz: Cora [20, 23], CiteSeer [20, 23], Facebook Page-Page webgraph [21], PubMed-Diabetes [17], Internet-Industry partnerships [2, 12], and Terrorists-Relationship [29].

Table 1. Benchmark data sets

3.2 Data Preprocessing

All benchmark data sets ought to be comprised of actors and ties already encoded as discrete data (natural-number format). However, Cora, CiteSeer, Facebook-Page2Page, PubMed-Diabetes, and Terrorists-Relation data sets are made up of nodes and/or edges encoded in mixed formats (categorical and numerical formats). Thus, it is necessary to transcode these non-numeric (categorical) entities to their respective discrete (numeric) data representation, without semantic loss, via an injective function that maps each distinct entry in the categorical-entity domain to distinct numeric values in the discrete-data codomain, \( f_m : categorical \rightarrow discrete \). Thereafter, the numeric representation of all benchmark data sets are normalized, \( f_n : discrete \rightarrow continuous \), prior to training against RLVECN and the benchmark models. Also, only edgelist ties, \( E \subset G \), whose constituent actors are present in the nodelist, \( V \subset G \), were used for training/testing our model.

Table 2. Configuration of experimentation hyperparameters
Fig. 3.
figure 3

RLVECN’s learning curves during node-classification training against Cora, CiteSeer, Facebook-Page2Page, PubMed-Diabetes, Internet-Industry-Partnership, and Terrorists-Relation data sets (loss function vs training epochs)

Table 3. Node-classification over Cora data set. Results are based on the set apart validation sample - data set vs models.
Table 4. Node-classification over CiteSeer data set. Results are based on the set apart test sample - data set vs models.
Table 5. Node-classification experiment over Facebook Page-Page webgraph and PubMed-Diabetes data sets. Results are based on the reserved test sample - data set vs models. N.B.: Mtc \( = \) Metric (fitness function); Pts \( = \) Points.
Table 6. Node-classification experiment over Internet-Industry-Partnership and Terrorists-Relationship data sets. Results are based on the reserved validation sample - data set vs models. N.B.: Mtc \( = \) Metric (fitness function); Pts \( = \) Points.

4 Experiment, Results, and Discussions

RLVECN’s experimentation setup was tuned in accordance with the hyperparameters shown in Table 2. Our evaluations herein were recorded with reference to a range of objective functions. Thus, Categorical Cross Entropy was employed as the cost/loss function; while the fitness/utility was measured based on the following metrics: Precision (PC), Recall (RC), F-measure or F1-score (F1), Accuracy (AC), and Area Under the Receiver Operating Characteristic Curve (RO). Moreover, the objective functions have been computed against each benchmark data set with regard to the constituent classes (or categories) present in each data set. The Support (SP) represents the number of ground-truth samples per class/category for each data set.

In a bid to avoid sample bias across-the-board, we have used exactly the same SP for all models inclusive of RLVECN model. However, since RLVECN is based on an edge-sampling technique; the SP recorded against RLVECN model represent the numbers of edges/ties used for computation as explained in Algorithm 1. With regard to the standard node-classification tasks herein, the performance of our RLVECN model during benchmarking against five(5) popular baselines (DeepWalk, GCN, LINE, Node2Vec, SDNE); and when evaluated against the validation/test samples for the benchmark data sets are as documented in Table 3, 4, 5, and 6 respectively. Consequently, Fig. 3 graphically shows the learning-progress curves during the node-classification tasks using our proposed RLVECN model; and when training over the benchmark data sets. Hence, the dotted-black lines represent learning progress over the training set; and the dotted-blue lines represent learning progress over the test set.

Tables 3, 4, 5, and 6 have clearly tabulated our results as a multi-classification task over the benchmark data sets. Thus, for each class per data set, we have laid emphasis on the F1 (the weighted average of the PC and RC metrics) and RO. Therefore, we have highlighted the model which performed best (based on F1 and RO metrics) for each classification task using a bold font. Additionally, we have employed a point-based ranking standard to ascertain the fittest model for each node classification task. The model with the highest aggregate point signifies the fittest model for the specified task, and so on in a descending order of aggregate points. Accordingly, as can be seen from our tabular results, our proposed methodology (RLVECN) is at the top with the highest fitness points.

L2 regularization (\( L2 = 0.04 \)) [7] and early stopping [18] were used herein as addon regularization techniques to overcome overfitting incurred during RLVECN’s training. Hence, the application of early stopping with respect to RLVECN’s training over the benchmark data sets were, viz: 50 epochs (Cora, CiteSeer, Facebook-Page2Page, PubMed-Diabetes, Terrorists-Relation) and 180 epochs (Internet-Industry-Partnership). We have used a mini-batch size of 256 for training and testing/validating because we want to ensure that sufficient patterns are extracted by RLVECN during training before its network weights are updated.

5 Limitations, Conclusion, Future Work, and Acknowledgements

The benchmark models evaluated herein were implemented using their default parameters. We were not able to evaluate GCN [11] against Facebook Page-Page webgraph, PubMed-Diabetes, and Internet-Industry-Partnership data sets; because each of these aforementioned data sets does not possess a vectorized feature set which is required by GCN model for input-data processing. Overall, RLVECN’s remarkable performance with respect to our benchmarking results can be attributed to the following:

  1. (1)

    The RL kernel of RLVECN is constituted of two (2) distinct layers of FL, viz: Knowledge-Graph Embeddings and ConvNet [13].

  2. (2)

    The high-quality data preprocessing techniques employed herein with respect to the benchmark data sets. We ensured that all constituent actors of a given social graph were transcoded to their respective discrete data representations, without any loss in semantics, and normalized prior to training and/or testing.

In conclusion, we intend to expand our experimentation scope to include more social network data sets and benchmark models. This research was made possible by International Business Machines (IBM), SHARCNET and Compute Canada (www.computecanada.ca).