Supervised temporal link prediction in large-scale real-world networks

Link prediction is a well-studied technique for inferring the missing edges between two nodes in some static representation of a network. In modern day social networks, the timestamps associated with each link can be used to predict future links between so-far unconnected nodes. In these so-called temporal networks, we speak of temporal link prediction. This paper presents a systematic investigation of supervised temporal link prediction on 26 temporal, structurally diverse, real-world networks ranging from thousands to a million nodes and links. We analyse the relation between global structural properties of each network and the obtained temporal link prediction performance, employing a set of well-established topological features commonly used in the link prediction literature. We report on four contributions. First, using temporal information, an improvement of prediction performance is observed. Second, our experiments show that degree disassortative networks perform better in temporal link prediction than assortative networks. Third, we present a new approach to investigate the distinction between networks modelling discrete events and networks modelling persistent relations. Unlike earlier work, our approach utilises information on all past events in a systematic way, resulting in substantially higher link prediction performance. Fourth, we report on the influence of the temporal activity of the node or the edge on the link prediction performance, and show that the performance differs depending on the considered network type. In the studied information networks, temporal information on the node appears most important. The findings in this paper demonstrate how link prediction can effectively be improved in temporal networks, explicitly taking into account the type of connectivity modelled by the temporal edge. More generally, the findings contribute to a better understanding of the mechanisms behind the evolution of networks.


Introduction and problem statement
Link prediction is a frequently employed method within the broader field of social network analysis (Barabási 2016).Many important real-world applications exist in a variety of domains.Two examples are the prediction of (1) missing links between pages of Wikipedia and (2) which users are likely to be friends on an online social network (Kumar et al. 2020).Link prediction is often defined as the task to predict missing links based on the currently observable links in a network (Linyuan and Zhou 2011).Many realworld networks have temporal information on the times when the edges were created (Divakaran and Mohan 2020).Such temporal networks are also called dynamic or evolving networks.They open up the possibility of doing temporal link prediction.This means that they are able to infer future edges between two nodes as opposed to predicting only missing links (Liben-Nowell and Kleinberg 2007).For instance, in friendship networks, temporal link prediction might (1) facilitate friend recommendations, and (2) may actually predict which people will form new friendships in the future.
Existing work on temporal link prediction is typically performed on one or a handful of specific networks, making it difficult to assess the generalisability of the approaches used (Marjan et al. 2018).This paper presents the first largescale empirical study of temporal link prediction on 26 different large-scale and structurally diverse temporal networks originating from various domains.In doing so, we provide a 80 Page 2 of 16 systematic investigation of how temporal information is best used in a temporal link prediction model.
This can be illustrated briefly by the social networks used in this study.They have a higher density then the other networks.This might improve performance in temporal link prediction, because the pairs of nodes in the networks used in our study are likely to have more common neighbours, providing more information to the supervised link prediction model.Thus, it is important to have an understanding of the relation between structural characteristics of the network and performance of topological features.
A common approach in temporal link prediction is to employ a supervised machine learning model that utilises multiple features to classify which links are missing or, in case of temporal link prediction, will appear in the future (de Bruin et al. 2021).Features are typically computed for every pair of nodes that is not (yet) connected, based on the topology of the network (Kumar et al. 2020).These topological features essentially calculate a similarity score for a node pair, where a higher similarity signals a higher likelihood that this pair of nodes should be connected.Commonly used topological features, both used in supervised and unsupervised learning, include Common Neighbours, Jaccard Coefficient and Preferential Attachment (Sect. 4.1.1).These features clearly relates to the structural position of the nodes in the network.Previous work has suggested a straightforward approach to take the temporal evolution into account in these topological features (Tylenda et al. 2009;Bütün et al. 2018).We describe the process of obtaining the set of temporal topological features in Sect.4.1.2.The benefits of using this set of features are that they are well-established and interpretable.Moreover, recent work has shown that in a supervised classifier these topological features perform as well as other types of features that are less interpretable and more complex (Ghasemian et al. 2020).A further comparison with other types of features is provided in Sect. 2.
In our work, we extend the set of state-of-the-art temporal topological features by considering that two types of temporal networks can be distinguished: networks with persistent relationships and networks with discrete events (O'Madadhain et al. 2005).The aforementioned example of friendship networks contains edges marking persistent relationships, that occur at most once for related persons.In case of communication networks, an edge usually marks a discrete event at an associated timestamp, representing a message sent from one person to another.In contrast to networks with persistent relationships, multiple edges can occur between two persons in discrete event networks.Previous studies have ignored that each link is not of the same type.In our approach, we address this gap in the literature by means of what we coin past event aggregation.This allows us to take both types of temporal links into account, where all information of two-faceted past interactions (i.e.persistent and discrete) are incorporated into the temporal topological features.
Last but not least, the temporal topological features implicitly assume so-called edge-centred temporal behaviour, suggesting that phenomena at the level of links determine the evolution of the network.Here, we may challenge the usual assumption that the temporal aspect is merely caused by the activity of nodes, being the decision-making entities in the network, and operating somewhat independently of the structure of the remainder of the network (Hiraoka et al. 2020).To investigate whether this assumption holds, we present a comparison between (1) temporal topological features and (2) features consisting of static topological features along with features capturing temporal node activity.By testing this distinction on the 26 different temporal networks, we can better understand whether the temporal aspect is best captured by considering edge-centred or node-centred temporal information.
To sum up, the four contributions of this work are as follows.First, to the best of our knowledge, we are the first to present a large-scale empirical study of temporal link prediction on a variety of networks.In total, we assess the performance of a temporal link prediction model on 26 structurally diverse networks, varying in size from a few hundred to over a million nodes and edges.Second, we analyse possible relations between structural network properties and the observed performance in temporal link prediction.We find that networks with degree disassortativity, signalling frequent connections between nodes with different degrees, show better performance in temporal link prediction.Third, we propose to account for all past interactions in discrete event networks.Fourth, in an attempt to understand the relation between node-centred and edge-centred temporal behaviour, we find that the information networks used in this study stand out, as they appear to have more node-centred temporal behaviour.
This work is structured as follows.In Sect.2, we further elaborate on related work.Section 3 provides the notation used in this work, leading up to the definition of temporal link prediction.After that, we continue with the approach in Sect. 4. This will be followed by a description of the temporal networks in Sect. 5.In Sect.6 the four results of the experiments are presented and discussed.In Sect.7, the conclusion is presented, together with suggestions for future work.

Related work
Although there is much literature available on link prediction, we found that attention for temporal networks and temporal link prediction is relatively limited.Some reviews have been published.They are pointing out the various approaches that exists towards temporal link prediction (Dhote et al. 2013;Divakaran and Mohan 2020).Consequently, we will start with an exploration of four types of approaches presented therein.
First, probabilistic models require (1) additional node or edge attributes to obtain sufficient performance (which hinders a generic approach to all networks) or (2) techniques that do not scale to larger networks (Kumar et al. 2020) (rendering them unusable for the larger networks used in the study).
Second, approaches such as matrix factorisation, spectral clustering (Romero et al. 2020), and deep learning approaches, like DeepWalk (Perozzi et al. 2014) and Node-2Vec (Grover and Leskovec 2016), all try to find a lower dimensional representation of the temporal network and use the obtained representation as a basis for link prediction.These approaches all learn a representation of the network without requiring explicit engineering of features.However, the downside is that the obtained features are hard to interpret, thereby making it difficult to explain why a certain link is predicted to appear.In applied scenarios, under some jurisdictions, this explanation can be required by law when an employed machine learning model affects people, which is often the case in for example the health and law enforcement domain (Holzinger et al. 2017).As an example, in previous work we examined driving patterns of trucks in a socalled truck co-driving network, where trucks are connected when they frequently drive together (de Bruin et al. 2020).When an inspection agency would use gathered network information to predict which trucks should be inspected for possible misconduct, truck drivers may legally have the right to know why they were selected.Since we aim to provide a approaches towards temporal link prediction that are applicable to any scientific domain, we disregard approaches that learn a lower dimensional representation.
Third, in the time series forecasting approach, the temporal network is divided into multiple snapshots (Potgieter et al. 2007;Da Silva Soares and Prudencio 2012;Öczan and Öğüdücü 2015;Güneş et al. 2016;Öczan A, Öğüdücü 2017).For each of these snapshots, static topological features are learned.By using time series forecasting, the topological features of a future snapshot are learned, enabling link prediction.This approach does scale well to larger networks and is interpretable.However, it is unclear into how many snapshots the temporal network should be divided and whether the number of snapshots should remain constant across all networks used.Again, hindering a truly generic approach.
Finally, we focus on temporal topological features in this work (Tylenda et al. 2009;Bütün et al. 2016).Recent work has suggested that the use of topological features in supervised learning can outperform more complex features learned from a lower dimensional representation of the temporal network (Ghasemian et al. 2020).Section 4 provides further details on this concept.These topological features are provided to a supervised link prediction classifier.Many different classification algorithms are known to work well in link prediction.Commonly used classifiers include logistic regression (Potgieter et al. 2007;O'Madadhain et al. 2005), support vector machines (Al Hasan et al. 2006;Öczan A, Öğüdücü 2017), k-nearest neighbours (Al Hasan et al. 2006;Bütün et al. 2018Bütün et al. , 2016)), and random forests (Öczan A, Öğüdücü 2017;Bütün et al. 2016Bütün et al. , 2018;;Ghasemian et al. 2020;de Bruin et al. 2021de Bruin et al. , 2020)).We report performances using the logistic regression classifier.This classifier provides the following benefits, (1) it allows intuitive explanation on how each instance is classified (Bishop 2013), (2) the classifier is relatively simple and hence interpretable (Molnar 2020), (3) the classifier scales well to larger networks, and (4) good results are achieved without any parameter optimisation (O'Madadhain et al. 2005).
To sum up, in contrast to earlier works on temporal link prediction, which has been applied on only a handful networks (Bliss et al. 2014;Öczan and Öğüdücü 2015;Potgieter et al. 2007;Bütün et al. 2016Bütün et al. , 2018;;Da Silva Soares and Prudencio 2012;Güneş et al. 2016;Öczan A, Öğüdücü 2017;Tylenda et al. 2009;O'Madadhain et al. 2005;Soares and Prudêncio 2013;Muniz et al. 2018;Romero et al. 2020), we apply link prediction on a structurally diverse set of 26 large-scale, real-world networks.We aim to do so using a generic, scalable and interpretable approach.

Preliminaries
This section starts by describing the notation used throughout this paper in Sect.3.1.In Sect.3.2, we explain the various network properties and measures used in this work.Finally, in Sect.3.3 we formally describe the temporal link prediction problem.

Notation
An undirected, temporal network H [t � ,t �� ] (V, E H ) consists of a set of nodes V and edges (or, equivalently, links) occur between timestamps t ′ and t ′′ .Networks with discrete events, where multiple events can occur between two nodes, can be seen as a multigraph, where multi-edges exist: links between the same two nodes, but with different timestamps (Gross et al. 2013).In this work, removal of edges is not considered, since this information is not available for most temporal networks.
A static representation of the underlying network is needed for the comparison between static and temporal features (see Sect. 4).This static, simple graph temporal network H [t � ,t �� ] by considering all edges that occur between t ′ and t ′′ , collapsing the multi-edges into a single edge.The number of nodes (also called the size) of the graph is n = |V| and the number of edges is m = |E G | .For convenience in later definitions, (u) is the set of all neighbours of node u ∈ V .The size of this set, i.e. | (u)| , is the degree of node u.

Real-world network properties and their measures
Several properties exist that characterise the global structure of a network (Barabási 2016).These properties guide us in the exploration of how structure relates to the performance of a temporal link prediction algorithm.Below we discuss four of the main properties.Each of the properties is defined on the static underlying graph • Density The density of a network is calculated by dividing the number of edges by the total number of possible edges, i.e. 2m∕n(n − 1) .For networks of the same size, higher density means that the average degree of nodes is higher, which has implications for the overall structural information available to the link prediction classifier.
• Diameter The diameter, sometimes called the maximum distance, is the largest distance observed between any pair of nodes.This property, together with density, captures how well-connected a network is.• Average clustering coefficient The average clustering coefficient is the overall probability that two neighbours of a randomly selected node are linked to each other.The average clustering coefficient is given by where L u represents the number of edges between the neighbours of node u.In real-world networks, and in particular social networks, often highly clustered networks are observed.• Degree assortativity It is often observed that nodes do not connect to random other nodes, but instead connect to nodes that are similar in some way.For instance, in social networks degree assortativity is observed, meaning that nodes often connect to other nodes with a similar degree.We can measure the degree assortativity of a network, by calculating the Pearson correlation coefficient, , between the degree of nodes at both ends of all edges (Newman 2002).In case low degree nodes more frequently connect with high degree nodes, the obtained value is negative.

The goal of a supervised link prediction model
The goal of a supervised link prediction model is to predict for unconnected pairs of nodes in the temporal network H [t q=0 ,t q=s ] whether they will connect in an evolved interval [t q=s , t q=1 ] where q marks the q-th percentile of observed timestamps in the network and 0 < s < 1 .Hence, timestamps t q=0 and t q=1 mark the time associated to the first and last edge in the network, respectively.Moreover, timestamp t q=s marks the time used to split the network into two intervals.The examples provided to the supervised link prediction model are pairs of nodes that are not connected in [t q=0 , t q=s ] .For each example (u, v) in the dataset, a feature vector x (u,v) and binary label y (u,v) is provided to the supervised link prediction model.The label for each pair of nodes (u, v) is y (u,v) = 1 when it will connect in [t q=s , t q=1 ] and y (u,v) = 0 otherwise.Because parameter s determines the number of considered nodes, it affects the class imbalance encountered in the supervised link prediction; values close to 1, results in a larger number of node pairs to consider, while limiting the number of positives.
The features used in the supervised link prediction model are only allowed to use information of network H [t q=0 ,t q=s ] , pre- venting any leakage from nodes that will connect in the evolved time interval [t q=s , t q=1 ] .Note that the temporal infor- mation contained in the network is used for two purposes; (1) it allows to split the network into two temporal intervals and (2) it is used in feature engineering to model temporal evolution.

Approach
This section explains the features used towards a supervised temporal link prediction.We start with an explanation of the different sets of features used in this study in Sect.4.1.In particular, in step 2 of Sect.4.1.2,we present a novel and intuitive approach to incorporate information on past interactions in the case of discrete event networks.In Sect.4.2, we discuss the supervised link prediction model.

Features
In this subsection, we explain three types of features used.First, in Sect.4.1.1static topological features are provided.Second, temporal topological features are given in Sect.4.1.2.The node activity features are specified in Sect.4.1.3.

Static topological features
We use four common static topological features, which together form the feature vector for each candidate pair of nodes (u, v).These features are computed on the static graph G underlying the temporal network considered, as defined in Sect.3.1.Below we define each of them.
Common Neighbours (CN) The CN feature is equal to the number of common neighbours of two nodes.

Adamic-Adar (AA)
The AA feature considers all common neighbours, favouring nodes with low degrees (Adamic and Adar 2003).

Jaccard Coefficient (JC)
The JC feature is similar to the CN feature, but normalises for the number of unique neighbours of the two nodes.
Preferential Attachment (PA) The PA feature takes into account the observation that nodes with a high degree are more likely to make new links than nodes with a lower degree.

Temporal topological features
Straightforward temporal extensions to topological features have been proposed in the literature (Tylenda et al. 2009;Bütün et al. 2018).Our method extends these approaches to also capture past interactions in case of aforementioned discrete event networks.The construction of these features then requires three steps, namely: A. Temporal weighting.B. The proposed approach of past event aggregation.C. Computation of weighted topological features.
The resulting feature vector for a given pair of nodes, after applying the three steps, consists of all possible combinations of 3 different temporal weighting functions (linear, exponential, square root), 8 different past event aggregations (see below under B) and 4 different weighted topological features (CN, AA, JC, PA).Thus, for discrete event networks the feature vector is of length 3 ⋅ 8 ⋅ 4 = 96 and for networks with persistent relationships it is of length 3 ⋅ 4 = 12 .
A: Temporal weighting The topological features need weighted edges (step C), while the networks used in this study have edges with an associated timestamp.In the temporal weighting step, we obtain these weights in a (1) procedure described by Tylenda et al. (2009).The definitions of the temporal weighting functions are provided in Eqs. ( 5)-( 7).In these definitions, a numeric timestamp t is converted to a weight w.Note that t min and t max denote the earliest and latest observed timestamp over all edges of the considered network.
In Fig. 1 the behaviour of the different weighting functions is shown, when applied to the DBLP network (Ley 2002).It is further described in Sect. 5.The exponential weighting function (Eq.6) assigns a higher weight to more recent edges than the linear (Eq.5) and square root (Eq.7) functions.In contrast, the square root function assigns higher weights to older edges in comparison to the linear and exponential functions.When weights of older edges become close to zero, these edges are discarded by the weighted topological features.To prevent that edges far in the past are discarded completely, we bound the output of each weighting function between a positive value l and 1.0 (l stands for lower bound).B: Past event aggregation In case of networks with discrete events, each multi-edge has an associated weight after the previous temporal weighting step.To allow the weighted topological features to be computed, we (5) need to obtain a single weight for each node pair, capturing their past activity.Here we propose to aggregate all past events using eight different aggregation functions.All eight functions use as input a set containing all the weights of past events.The following functions are used: (1) the zeroth, (2) first, (3) second, (4) third, (5) fourth quantile and the (6) sum, (7) mean, and ( 8) variance of all past weights.By means of these summary statistics, we aim to capture the fact that depending on which network is considered, it may matter whether interaction took place very often, far away in the past, or very recent.These aggregation functions aim to capture different temporal behaviours.The quantile functions bin the set of weights, which is a common step in feature engineering.Taking the sum, mean, variance of the set of weights, allow the model to capture also more complex trends.An example of these complex trends, is the so-called bursty behaviour, which is often observed in real-world data (Barabási 2005).C: Weighted topological features In Eqs. ( 8)-( 11), the weighted topological features are presented, which are taken from Bütün et al. (2018).In these equations, wtf (u, v) denotes the weight obtained for a given pair of nodes (u, v) after edges have been temporal weighted and, in case of networks with discrete events, events have been aggregated.

Node activity features
The goal of the node activity features is to capture nodecentred temporal activity.To this end, we create the node activity features in the following three steps: (1) temporal weighting, (2) aggregation of node activity, and (3) combining node activity.These steps are explained below.The feature vector for a given pair of nodes consists of all combinations of three different temporal weighting functions, seven different aggregation functions applied to the node activity, and four different combinations of the node activity.This results in a feature vector of length 3 ⋅ 7 ⋅ 4 = 84 . (8) wtf(v, y) (1) Temporal weighting The temporal weighing procedure is the same as used in feature engineering of the temporal weighted topological features (see Sect. 4.1.2).
(2) Aggregation of node activity For each node, the set of weights from all edges adjacent to the node under investigation is collected.To obtain a fixed feature vector for each node, the set of weights is aggregated using the following functions: (1) the zeroth, (2) first, (3) second, (4) third, (5) fourth quantile and the (6) sum and ( 7) mean of the node activity vector (here the variance of all node weights is suppressed).Similar to the engineering of the temporal topological features, these aggregations are used to capture different kinds of activity that a node may exhibit.In particular, nodes are known to show bursty activity patterns in some networks (Hiraoka et al. 2020).
(3) Combining node activity To take the activity, obtained in the previous two steps, of both nodes under consideration into account, we use four different combination functions.These four functions are the (1) sum, (2) absolute difference, (3) minimum, and (4) maximum.
By doing this, we obtain the node activity feature vector.

Supervised link prediction
The features discussed in Sect.4.1 serve as input for a supervised machine learning model that predicts whether or not a pair of currently disconnected nodes will connect in the future (see Sect. 3.3).
Here we use the logistic regression classifier.It was chosen because of its simplicity, overall good performance on this type of task and its explainability (see Sect. 2).We did not consider optimisation of any set of parameters, because that is considered outside the scope of the current paper.
In theory, a number quadratic in the number of nodes (i.e. the node pairs) could be selected as input for the classifier, with positive instances being node pairs that connect in the future, resulting in a significant class imbalance.To counter this problem and at the same time limit the computation time needed to train the model, we reduce the number of node pairs given as input to the classifier by the following two steps.
First, a well-known step is to select only pairs of nodes that are at a specific distance of each other (Lichtenwalter and Chawla 2012).Given the large sizes of networks used in this study, we limit the selection to only include pairs of nodes that are at distance two.Second, we sample with replacement from the remaining pairs of nodes a total of 10,000 that will connect (positives) and 10,000 that will not connect in the future (negatives).By doing this, we obtain a balanced set of examples which does not require any further post-processing, and can be used directly by the classifier.The training set for the logistic regression classifier is obtained by means of stratified sampling, taking 75% of all examples.The remaining instances are used as a test set.Because we do not optimise any parameters of the logistic regression classifier, no validation set is used.
Analogously to previous work Divakaran and Mohan (2020) we measure the performance of the classifier on the test set by means of the Area Under the Receiver Operating Characteristic Curve (AUC).The AUC only considers the ranking of each score obtained for each pair of nodes provided to the logistic regression classifier.The AUC does not consider the absolute value of the score.This makes the measured performance robust to cases where the applied threshold on the scores is chosen poorly.An AUC of 0.5 signals random behaviour, i.e. no classifier performance at all.A perfect performance is obtained when the AUC is equal to 1, which is highly unlikely in practical settings.

Data
In this work, we use a structurally diverse and large collection of in total 26 temporal networks.The networks can be categorised into the three different domains, namely social, information, and technological networks.The distinction of networks in these three domains are taken from other network repositories.In Table 1 some common structural properties of these datasets are presented (see Sect. 3.2 for definitions).It is apparent from Fig. 2, which shows the relation between the number of nodes and edges for each of the 26 datasets, that the selected networks span a broad range in terms of size.Also, for each network it is indicated whether the edges marks persistent relationships or discrete events.In the latter case, the network has a multigraph structure, which requires preprocessing as discussed in Sect.4.1.2.We observe seventeen networks showing degree disassortative behaviour, meaning that high degree nodes tend to connect to low degree nodes more frequently.The other nine networks show the opposite behaviour.We do not observe any significant relation between the domain of a network and its degree assortativity, or any other global property of the network.
A total of 21 networks were obtained from the Konect repository (Kunegis 2013), four networks from SNAP (Leskovec and Krevl 2014) and one from AMiner (Zhuang et al. 2013).The last column in Table 1 provides a reference to the work where each network is first introduced.Any directed network is converted to an undirected network by ignoring the directionality.In originally signed networks, we use only positive edges.

Experiments
In Sect.6.1 we start with the experimental setup.Then, the structure of this section follows the four contributions of this work.In Sect.6.2 the performance of temporal link prediction on 26 networks is assessed.Section 6.3 continues with the analysis of the relation between structural network properties and the performance in temporal link prediction.In Sect.6.4, we show the results of our methodological contribution to temporal link prediction in networks with discrete events.We finish in Sect.6.5 with a comparison between node-centred and edge-centred temporal behaviour.

Experimental setup
In Sect.3.3, the procedure to obtain examples and labels that serve as input for the classifier have been explained.In this procedure, we need to determine the value s for each network.Commonly, around two thirds of the edges are used for extraction of features (Lichtenwalter et al. 2010;Bütün et al. 2018Bütün et al. , 2016;;Al Hasan et al. 2006) and hence we choose s = 2 3 .In the feature engineering of the temporal topological and node activity features, the first step is to temporally weight each edge.In Sect.4.1.2,step 1, parameter l is introduced to prevent the discarding of old edges in the temporal weighting procedure.Based on earlier work (Tylenda et al. 2009), we set l = 0.2 , giving a minimal weight to links far away in the past, while still sufficiently discounting these older links.
In this work, we use four sets of features.These feature sets, indicated by capital Roman numerals, are as follows.
I Static topological (as defined in Sect.4.1.1).II-A Temporal topological (as defined in Sect.4.1.2).It is common practice to standardise features by subtracting the mean and scaling the variance to unit.The logistic regression classifier provided in the Python scikit-learn package (Pedregosa et al. 2011) is used.Although the goal of this paper is not to extensively compare machine learning classifiers, in Appendix A2, results on the performance in terms of AUC obtained using two other commonly used classifiers, being random forests (Pedregosa et al. 2011) and XGBoost (Chen and Guestrin 2016) are presented.For almost all datasets, similar relative performance is observed.The code used in this research, is available at http:// github.com/ gerri tjand ebruin/ snam2 021-code.It uses the Python language and the packages NetworkX (Hagberg et al. 2008) for network analysis, scikit-learn (Pedregosa et al. 2011) for the machine learning pipeline, and the Scipy ecosystem (Virtanen et al. 2020) for some of the feature engineering and statistical tests.The C++ library teexGraph (Takes and Kosters 2011) was used to determine the diameter of each network.The package versions, as well as all dependencies, can be found in the aforementioned repository.

Improvement of prediction performance with temporal information
We examine whether temporal information improves the overall prediction performance.Baseline performance is obtained by ignoring temporal information, using only static topological features (feature set I).In contrast, temporal topological features (feature set II-A) are used to obtain the performance of link prediction utilising temporal information.The results of this comparison are presented in Table 2 and in Fig. 3.They clearly indicate that using temporal information improves the prediction performance of new links, i.e. performance reported in column 'II-A' is always higher than that in 'I'.So, every single network shows better performance when temporal topological features are used.The average improvement in performance is 0.07 ± 0.04 (± standard deviation).For some networks, performance improves considerably more when temporal information is used in prediction.For example, the loans network has a mediocre baseline performance of 0.79, but a high performance of 0.95 is observed when temporal information is employed.This improvement in performance can be related to the structure of the network.Hence, in the next section the relation between the structural properties of networks and the performance in temporal link prediction is explored.

Structural network properties and link prediction performance
In this section, we examine which structural properties are associated to high link prediction performance.In Fig. 4, the Pearson correlations between the performance in (temporal) link prediction and various structural network properties (see  Sect. 3.2) are presented.While most properties show at best modest correlation with the link prediction performance, we observe a significant negative correlation between the degree assortativity of a network and the prediction performance of new links using static topological features ( p = 3 ⋅ 10 −6 ) and temporal topological features ( p = 5 ⋅ 10 −7 ).This means that strong disassortative behaviour in networks, where nodes of low degree are more likely to connect with nodes of high degree, show better performance in link prediction.The relation between degree assortativity and the link prediction performance is shown in more detail in Fig. 5.The observed negative correlation might be explained as follows.In realworld networks, low degree nodes typically largely outnumber the high degree nodes.However, nodes with a degree that by far exceeds the average degree, so-called hubs, are also relatively often observed in real-world networks (Barabási 2016).In degree dissortative networks, the numerous low degree nodes by definition connect more frequently with hubs than with other low degree nodes.For these low degree nodes, the preferential attachment feature will provide higher scores for candidate nodes having a high degree.Therefore, the supervised model can use this information in a straightforward manner to obtain a better performance.
To confirm the relation between the degree assortativity and temporal link prediction performance of a network, we conducted additional experiments.By performing assortative and dissassortative degree-preserving rewiring, we further substantiate the claim that disassortative networks indeed show higher link prediction performance.Detailed results can be found in Appendix A1.
In Fig. 5 we observe that the temporal topological features show an even stronger correlation ( = −0.82 ) than the static topological features ( = −0.78 ).A possible explanation is that the temporal features are able to determine with higher accuracy which nodes will grow to active hubs, linking to many low degree nodes, whereas this information would be lost in a static network representation.This observation provides additional evidence that the temporal topological features are likely capturing relevant temporal behaviour.

Enhancement of performance with past event aggregation
To assess how networks with discrete events should be dealt with in temporal link prediction, we use two different sets of features.The first set of features (II-A) is constructed with past event aggregation, which allows to make fully use of information contained in all discrete events.The second set of features (II-B) considers only the last occurring edge between two nodes, thereby ignoring any past events.For networks with persistent edges, the two sets of features yield the same results, because the networks do not contain past events.The performance obtained with these two different sets of features is reported in Table 2.In Fig. 6, we show the difference between the two performances of the networks with discrete events in more detail.From this figure, we learn that these networks all show better performance when past events are aggregated using the various aggregation functions.This result is interesting more broadly for link prediction research, as the derived feature modification steps can be inserted into any topological network feature aiming to capture the similarity of nodes in an attempt to predict their future connectivity.Interestingly, when looking in more detail at the performance improvement by past event aggregation for each discrete event network, we observe large differences.On the one hand, we observe networks with only minor improvement when past events are aggregated.For example, the Condense matter (scientific) collaboration network (Condm.)shows only a minor improvement of 0.706-0.760AUC.A possible explanation is that temporal information of discrete events has only limited use, since it takes time to come to a successful collaboration.On the other hand, the UC Irvine message network (UC), shows a major improvement in AUC from 0.744 to 0.893.This might be caused by the more variable nature of messages, which takes only a short time to establish.In that case, the feature set with past event aggregation might provide higher scores to pairs of nodes that are both actively messaging.

Comparison of node-and edge-centred temporal link prediction
In the experiments performed so far, we used temporal topological features to assess temporal link prediction performance.These features assume edge-centred temporal behaviour.In the experiments below, we compare the performance of the edge-centred features (feature set II-A) with features that assume node-centred temporal behaviour (feature set III).The results of both feature sets are presented in Table 2 and in more detail in Fig. 7.We observe a correlation ( = 0.92 , p = 0.009 ) between the obtained performances using the two sets of features on all 26 networks.This finding suggests that the temporal aspect of most networks can be modelled by using either node-centred or edge-centred temporal features.
However, for the four information networks the performance of the node-centred features seems to be higher than the edge-centred features.This finding hints that in information networks temporal behaviour may be node-centred.
Given the low number of information networks available in this study, further research should be conducted to a larger set of information networks to verify this finding.
A note of caution is due here since we analyse the temporal link performance only on pairs of nodes at a distance of two; different findings may be observed whether the findings still hold when more global features of node similarity are used.Notwithstanding this limitation, the study shows that both node-and edge-centred features in supervised temporal link predictions are able to achieve a high performance.

Conclusion and outlook
In this paper, the aim was to perform a large-scale empirical study of temporal link prediction, using a wide variety of structurally diverse networks.Moreover, we aimed to demonstrate the benefit of past event aggregation, allowing to take the rich interaction history of nodes into account in predicting their future linking activity.This study resulted in four findings.First, performance in supervised temporal link prediction is consistently higher when temporal information is taken into account.Second, the performance in temporal link prediction appears related to the global structure of the network.Most notably, degree disassortative networks perform better than degree assortative networks.Third, the newly proposed method of past event aggregation, is able to better model link formation in networks with discrete events.It substantially increases the performance of temporal link prediction.The derived feature modification steps can be inserted into any topological feature, potentially improving the performance of any supervised (temporal) link prediction endeavour.Fourth, we showed that in four information networks, features capturing node activity, together with static topological features, outperform features that consider edgecentred temporal information, suggesting that the temporal mechanisms in these networks reside with the nodes.
A natural next step of this work is to analyse even bigger temporal networks, or networks originating from different domains.It appears that publicly available networks from other domains, such as biological, economic and transportation networks, typically do not contain temporal information (Ghasemian et al. 2020).However, it would be interesting to investigate whether findings presented in this paper also hold for these types of networks.In addition, it is evident that there is an advantage to taking temporal information into account when performing supervised link prediction on temporal networks.It could be interesting to see whether such temporal information also benefits prediction performance in other machine learning tasks on networks, such as node classification (Hamilton et al. 2017).

Appendix A1: Relation degree assortativity and temporal link prediction performance
To further assess the relation between degree assortativity and temporal link prediction performance, as derived from the empirical results in Sect.6.3, we conducted additional experiments.By means of simulation, we modified a number of network datasets from Table 1 using assortative and disassortative degree-preserving rewiring, following an approach similar to the one proposed in Van Mieghem et al. (2010).
In particular, we aim to retain the local clustering properties by not selecting two edges at random, but rather selection two edges that are close to each other, ensuring that not too many triangles and therewith clustering is destructed, as this is a determining feature in link prediction.
The procedure, which we repeat for a certain number of times (explained below), consists of the following five steps.First, an edge (u, v) is randomly selected.Second, we randomly select a node x from the neighbourhood of u.Third, we sample a node y that is connected to x, but not to u or v.At this time, pairs of nodes (u, v) and (x, y) are connected while the link (v, y) is absent.The fourth step is to determine from the pairs of nodes (u, v), (v, y) and (x, y) which node pair has a maximum difference in degree.
Step five is the actual rewiring of edges.There can be three outcomes from step 4, (a) node pair (v, y) has the maximum difference in degree and there is no gain in assortativity by rewiring any edges, (b) node pair (u, v) has the maximum difference in degree and by moving all edges (recall, there can be multiple links between two nodes) between (u, v) to (v, y) the assortativity is increased, and (c) node pair (x, y) has the maximum difference in degree and by moving all edges between (x, y) to (v, y) the assortativity is increased.In case we want to perform dissassortative degree-preserving rewiring, we consider in step four and five the node pair with the lowest difference in degree.The five steps are repeated, with increments of 0.2m from the original network up to m of the edges that gets a chance to rewire.
The degree assortativity values of the rewired networks can be found in Table 3.We observe that for degree disassortative rewiring, a larger performance is attained than for assortative rewiring, strengthening our result from Sect.6.3.This finding is further explored in Table 4, in which we list the percentual increase in performance for both disassortativity and assortativity rewired datasets.In all cases, we observe higher performance for disassortativity rewired networks.

Appendix A2: Choice of classifier
As described in Sect.2, many classifiers are known to work well in link prediction.We used the logistic regression classifier in this work, for reasons of interpretability and explainability, as further discussed in Sect. 2. In Table 5,  we provide, for each of the datasets as introduced in Table 1, the performance in terms of AUC obtained using two other commonly used classifiers, being random forests (Pedregosa et al. 2011) and XGBoost (Chen and Guestrin 2016), with default parameters.For almost all datasets, similar relative performance is observed.

Fig. 1
Fig.1The mapping of the three different weighting functions for the entire DBLP network

Fig. 2
Fig.2The number of nodes and edges of the networks used in this paper.The horizontal and vertical axes have logarithmic scaling

Fig. 3
Fig. 3 Performance of the link prediction classifier for the 26 different networks

Fig. 4 Fig. 5
Fig. 4 Correlations between network properties and the performance in a supervised classifier learned only with static topological features (feature set I) and with temporal topological features (feature set II-A)

Table 1
Networks used in this workThe following abbreviations are used in the columns; D.a. degree assortativity, A.c.c average clustering coefficient, Diam.diameter.In the column 'domain', Technological is abbreviated to Tech. and Information to Inf

Table 4
Performance (in AUC) for the rewired networks as reported on in Table3