A key challenge for dynamic link prediction is choosing an effective feature set for this task. Earlier works choose features by adapting topological features for static link prediction or by considering the feature values of different snapshots as a time series. GraTFEL uses graphlet transition events (GTEs) as features for link prediction. For a given node-pair, the value of a specific GTE feature is a normalized count of the observed GTE involving those node-pairs over the training data. The strength of GTEs as feature for dynamic link prediction comes from the fact that for a given node-pair, GTEs involving those nodes capture both the local topology and their transition over the temporal snapshots.
We consider GTEs involving graphlets up to size five (total 30 graphlets), of which graphlets of size four are shown in Fig. 2
Footnote 2. The graphlet size upper bound of five is inspired by the fact that more than \(95\,\%\) new links in a dynamic network happen between vertices that are at most 3 distances apart in all three real-life dynamic networks that we use in this work. So, for a given node, GTE of a five vertex graphlet in the neighborhood of that node covers a prospective link formation event as a graphlet transition event. Another reason for limiting the graphlet size is the consideration of computation burden, which increases exponentially with the size of graphlets. There are 30 different graphlets of size up to 5 and the number of possible transition event (GTE) is \(O(30^2)\). Increasing the size of graphlets to 6 increases the number of GTE to \(O(142^2)\).
Feature representation for a node-pair in a dynamic network is constructed by concatenating GTE features from a continuous set of graph snapshots. Concatenation, being the simplest form of feature aggregation across a set of graph snapshots is not essentially the best feature representation to capture temporal characteristics of a node-pair. So, GraTFEL uses unsupervised feature extraction (UFE) to get optimal feature representation from GTE features. UFE provides a better feature representation by discovering dependency among different data dimensions, which cannot be achieved by simple aggregation. It also reduces the data dimension and overcomes the sparsity issue in GTE features. Once the optimal feature representation of a node-pair is known, GraTFEL uses that for solving the link prediction task using a supervised classification model.
The discussion of the proposed method GraTFEL can be divided into three steps: (1) graphlet transition event based feature extraction (Sect. 4.1), (2) unsupervised (optimal) feature learning (Sect. 4.2), and (3) supervised learning for obtaining the link prediction model (Sect. 4.3).
4.1 Graphlet Transition Based Feature Extraction
Say, we are given a dynamic network \(\mathbb {G} = \{G_1,G_2, \dots ,G_t\}\), and we are computing the feature vector for a node-pair (u, v), which constitute a row in our training data matrix. We use each of \(G_i:1\le i \le t-1\) (time window \([1,t-1]\)) for computing the feature vector and \(G_t\) for computing the target value (1 if edge exist between u and v, 0 otherwise). We use \(\varvec{e}_{[1,t-1]}^{uv}\) to represent this vector. It has two components: graphlet transition event (GTE) and link history (LH).
The first component, Graphlet Transition Event (GTE), \(\varvec{g}_{[1,t-1]}^{uv}\) is constructed by concatenating GTE feature-set of (u, v) for each time stamp. i.e., \(\varvec{g}^{uv}_{[1,t-1]} = \varvec{g}_1^{uv}~||~\varvec{g}_2^{uv}~||~\dots ~||~\varvec{g}_{t-1}^{uv}\). Here, the symbol || represents concatenation of two horizontal vectors (e.g., \(0~1~0 ~|| ~0.5~0~1 = 0~1~0~0.5~0~1\)) and \(\varvec{g}_i^{uv}\) represents (u, v)’s GTE feature-set for time stamp i, and it captures the impact of edge (u, v) at its neighborhood structure at time stamp i. We construct \(\varvec{g}_i^{uv}\) by enumerating all graphlet based dynamic events, that are triggered when edge (u, v) is added with \(G_i\).
For example, consider the toy dynamic network in Fig. 3. We want to construct the GTE feature vector \(\varvec{g}_{1}^{36}\), which is the GTE feature representation of node-pair (3, 6) at \(G_1\). We illustrate the construction process in Fig. 4. In this figure, we show all the graphlet transitions triggered by edge (3, 6) when it is added in \(G_1\). These transition events are listed in center table of Fig. 4. Column titled Nodes lists the sets of nodes where the graphlet transitions are observed and Column Current Graphlet shows the current graphlet structure induced by these nodes. Column Transformed Graphlet shows the graphlet structure after (3, 6) is added. The last column Graphlet Event is a visual representation of the transition events, where the transition is reflected by the red edges. Once all the transition events are enumerated, we count the frequencies of these events (Table on the right side of Fig. 4). Graphlet transition frequencies can be substantially different for different edges, so the GTE vector is normalized by the largest value of graphlet transition frequencies associated with this edge. Also note that, all possible graphlet transition events are not observed for a given edge. So, among all the possible types of GTE, those that are observed in at least one node-pair in the training dataset are considered in GTE feature-set.
The second component of node-pair feature vector is Link History (LH) of node-pair, which is not captured by GTE feature-set, \(\varvec{g}^{uv}_{[1,t-1]}\). Link History, \(\mathbf {lh}^{uv}_{[1,t-1]}\) of a node-pair (u, v) is a vector of size \(t-1\), denoting the edge occurrences between the participating nodes over the time window \([1,t-1]\). It is defined as, \(\mathbf {lh}^{uv}_{[1,t-1]} = \mathbf {G}_1(u,v)~||~\mathbf {G}_2(u,v)~||~\dots ~||~\mathbf {G}_{t-1}(u,v) \). Here, \(\mathbf {G}_i(u,v)\) is 1 if edge (u, v) is present in snapshot \(G_i\) and 0 otherwise. An appearance of an edge in recent time indicates a higher chance of the edge to reappear in near future. So, we consider weighted link history, \(\mathbf {wlh}^{uv}_{[1,t-1]} = w_1\cdot \mathbf {G}_1(u,v)~||~w_2\cdot \mathbf {G}_2(u,v)~||~\dots ~ ||~w_{t-1}\cdot \mathbf {G}_{t-1}(u,v)\). here, \(w_i=i/(t-1)\) (a time decay function) represents the weight of link history for time stamp i. Finally, a frequent appearance of an edge over time indicates a strong tendency of the edge to reincarnate in the future. This motivates us to reward such events by considering cumulative sum. We define Weighted Cumulative Link History, \(\mathbf {wclh}^{uv}_{[1,t-1]} = CumSum(\mathbf {wlh}^{uv}_{[1,t-1]})\).
Finally, the feature vector of a node-pair (u, v), \(\mathbf {e}^{uv}_{[1,t-1]}\), is the concatenation of GTE feature-set (\(\varvec{g}^{uv}_{[1,t-1]}\)) and LH feature-set (\(\mathbf {wclh}^{uv}_{[1,t-1]}\)); i.e., \(\mathbf {e}^{uv}_{[1,t-1]} = \varvec{g}^{uv}_{[1,t-1]}~||~\mathbf {wclh}^{uv}_{[1,t-1]}\). For predicting dynamic links in time stamp \(t+1\), we right-shift the time window by one. In other words, we construct graphlet feature representation \(\mathbf {e}^{uv}_{[2,t]}\) by using snapshots from time window [2, t]. Final feature representation for all node-pairs,
Here, \(\mathbf {\hat{E}}\) is the training dataset and \(\mathbf {\bar{E}}\) is the prediction dataset. Both \(\mathbf {\hat{E}}\) and \(\mathbf {\bar{E}}\) can be represented as matrices of dimensions (m,k). The size of the feature vector is \(k=|\mathbf {e}^{uv}_{[1,t-1]}|=c*(t-1)+t-1\), where c is the total number of distinct GTEs that we consider.
GTE enumeration. We compute GTEs by using a local growth algorithm. For computing \(\varvec{g}_{i}^{uv}\), we first enumerate all graphlets of \(G_i\) having both u and v. Starting from the edge graphlet \(gl=\{u,v \}\), in each iteration of growth we add a new vertex w from the immediate neighborhood of gl to obtain a larger graphlet \(gl=gl\cup \{w\}\). Growth is terminated when \(|gl|=5\). The enumeration process is identical to our earlier work [17]. After enumeration, GTE is easily obtained from graphlet embedding by marking the edge (u, v) as the transition trigger (see Fig. 4). The computation of GTEs of different node-pairs are not dependent on each other, this makes GTE enumeration task embarrassingly parallel.
4.2 Unsupervised Feature Extraction
GraTFEL performs the task of unsupervised feature extraction as learning an optimal coding function h. Lets consider, \(\varvec{e}\) is a feature vector from either \(\mathbf {\hat{E}}\) or \(\mathbf {\bar{E}}\) (\(\varvec{e} \in \mathbf {\hat{E}}\cup \mathbf {\bar{E}}\)). Now, the coding function h compresses \(\varvec{e}\) to a code vector \(\varvec{\alpha }\) of dimension l, such that \(l < k\). Here l is a user-defined parameter which represents the code length and k is the size of feature vector. Many different coding functions exist in the dimensionality reduction literature, but GraTFEL chooses the coding function which incurs the minimum loss in the sense that from the code \(\varvec{\alpha }\) we can reconstruct \(\varvec{e}\) with the minimum error over all possible \(\varvec{e} \in \mathbf {\hat{E}}\cup \mathbf {\bar{E}}\). We frame the learning of h as an optimization problem, which we discuss below through two operations: Compression and Reconstruction.
Compression: It obtains \(\varvec{\alpha }\) from \(\varvec{e}\). This transformation can be expressed as a nonlinear function of linear weighted sum of the graphlet transition features.
$$\begin{aligned} \varvec{\alpha } = f(\varvec{W}^{(c)}\varvec{e} + \varvec{b}^{(c)}) \end{aligned}$$
(2)
\(\varvec{W}^{(c)}\) is a (k, l) dimensional matrix. It represents the weight matrix for compression and \(\varvec{b}^{(c)}\) represents biases. \(f(\cdot )\) is the Sigmoid function, \(f(x)=\frac{1}{1+e^{-x}}\).
Reconstruction: It performs the reverse operation of compression, i.e., it obtains the graphlet transition features \(\mathbf {e}\) from \(\varvec{\alpha }\) which was constructed during the compression operation.
$$\begin{aligned} \varvec{\beta } = f(\varvec{W}^{(r)}\varvec{\alpha } + \varvec{b}^{(r)}) \end{aligned}$$
(3)
\(\varvec{W}^{(r)}\) is a matrix of dimensions (l, k) representing the weight matrix for reconstruction, and \(\varvec{b}^{(r)}\) represents biases.
The optimal coding function h constituted by the compression and reconstruction operations is defined by the parameters \((\varvec{W},\varvec{b}) = (\varvec{W}^{(c)},\varvec{b}^{(c)},\varvec{W}^{(r)},\varvec{b}^{(r)})\). The objective is to minimize the reconstruction error. Reconstruction error for a graphlet transition feature vector \(\varvec{(}e)\) is defined as, \(J(\varvec{W,b,e}) = \frac{1}{2}\parallel \varvec{\beta }-\varvec{e} \parallel ^2\). Over all possible feature vectors, the average reconstruction error augmented with a regularization term yields the final objective function \(J(\varvec{W},\varvec{b})\) which we want to minimize:
$$\begin{aligned} {\begin{matrix} J(\varvec{W},\varvec{b}) = \frac{1}{2m} \sum _{\varvec{e} \in \mathbf {\hat{E}}\cup \mathbf {\bar{E}}}(\frac{1}{2}\parallel \varvec{\beta }-\varvec{e} \parallel ^2) +\frac{\lambda }{2} (\parallel \varvec{W}^{(c)}\parallel _F^2 + \parallel \varvec{W}^{(r)}\parallel _F^2 ) \end{matrix}} \end{aligned}$$
(4)
Here, \(\lambda \) is a user assigned regularization parameter, responsible for preventing over-fitting. \(\parallel \cdot \parallel _F\) represents the Frobenius norm of a matrix.
Training: The training of optimal coding defined by parameters \((\varvec{W,b})\) begins with random initialization of the parameters. Since the cost function \(J(\varvec{W},\varvec{b})\) defined in Eq. (4) is non-convex in nature, we obtain a local optimal solution using the gradient descent approach. The parameter update of the gradient descent is similar to the parameter update in Auto-encoder in machine learning. Unsupervised feature representation of any node-pair (u, v) can be obtained by taking the outputs of compression stage (Eq. (2)) of the trained optimal coding (\(\varvec{W,b}\)).
$$\begin{aligned} \varvec{\alpha }_{[1,t-1]}^{uv} = f(\varvec{W}^{(c)}\varvec{e}_{[1,t-1]}^{uv} + \varvec{b}^{(c)})=h(\varvec{e}_{[1,t-1]}^{uv}) \end{aligned}$$
$$\begin{aligned} \varvec{\alpha }_{[2,t]}^{uv} = f(\varvec{W}^{(c)}\varvec{e}_{[2,t]}^{uv} + \varvec{b}^{(c)})=h(\varvec{e}_{[2,t]}^{uv}) \end{aligned}$$
Computational Cost: We use Matlab implementation of optimization algorithm L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) for learning optimal coding. Non-convex nature of cost function allows us to converge to local optima. We execute the algorithm for limited number of iterations to obtain unsupervised features within a reasonable period of time. Each iteration of L-BFGS executes two tasks for each edge: back-propagation to compute partial differentiation of cost function, change the parameters \(\varvec{(W,b)}\). For each edge the time complexity is O(kl); here, k is the length of basic feature representation, l is length of unsupervised feature representation. Therefore, the time complexity of one iteration is O(mkl), where m is the number of possible edges.
4.3 Supervised Link Prediction Model
Training dataset, \(\mathbf {\hat{E}}\) is feature representation for time snapshots \([1,t-1]\), The ground truth (\(\hat{\varvec{y}}\)) is constructed from \(G_t\). After training the supervised classification model using \(\hat{\varvec{\alpha }}=h(\hat{\mathbf {E}})\) and \(\hat{\varvec{y}}\), prediction dataset \(\mathbf {\bar{E}}\) is used to predict links at \(G_{t+1}\). For this supervised prediction task, we experiment with several classification algorithms. Among them SVM (support vector machine) and AdaBoost perform the best.
4.4 Algorithm
The pseudo-code of GraTFEL is given in Algorithm 1. For training link prediction model, we split the available network snapshots into two overlapping time windows, \([1,t-1]\) and [2, t]. GTE features \(\hat{\varvec{E}}\) and \(\bar{\varvec{E}}\) are constructed in Lines 2 and 4, respectively. Then we learn optimal coding for node-pairs using graphlet transition features \((\hat{\mathbf {E}} \cup \bar{\mathbf {E}})\) in Line 5. Unsupervised feature representations are constructed using learned optimal coding (Lines 6 and 7). Finally, a classification model C is learned (Line 8), which is used for predicting link in \(G_{t+1}\)(Line 9).