Knowledge distillation on neural networks for evolving graphs

Graph representation learning on dynamic graphs has become an important task on several real-world applications, such as recommender systems, email spam detection, and so on. To efficiently capture the evolution of a graph, representation learning approaches employ deep neural networks, with large amount of parameters to train. Due to the large model size, such approaches have high online inference latency. As a consequence, such models are challenging to deploy to an industrial setting with vast number of users/nodes. In this study, we propose DynGKD, a distillation strategy to transfer the knowledge from a large teacher model to a small student model with low inference latency, while achieving high prediction accuracy. We first study different distillation loss functions to separately train the student model with various types of information from the teacher model. In addition, we propose a hybrid distillation strategy for evolving graph representation learning to combine the teacher’s different types of information. Our experiments with five publicly available datasets demonstrate the superiority of our proposed model against several baselines, with average relative drop 40.60%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$40.60\%$$\end{document} in terms of RMSE in the link prediction task. Moreover, our DynGKD model achieves a compression ratio of 21:100, accelerating the inference latency with a speed up factor ×30\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times 30$$\end{document}, when compared with the teacher model. For reproduction purposes, we make our datasets and implementation publicly available at https://github.com/stefanosantaris/DynGKD.


Introduction
Graph representation learning has been at the core of several machine learning tasks on graphs, such as node classification (Zhang et al. 2019;Qu et al. 2019;Kipf and Welling 2017) and link prediction (Kumar et al. 2019;Zhang and Chen 2018).The main objective is to learn low-dimensional node embeddings, so that the structure of the graph is reflected on the embedding space.Early approaches work primarily on static graphs (Grover and Leskovec 2016;Veličković et al. 2018;Perozzi et al. 2014).However, most real-world applications are dynamic, where graphs evolve over time.Recently, dynamic approaches have been proposed to capture both the topological and temporal properties of evolving graphs in the node embeddings (Sankar et al. 2020;Nguyen et al. 2018;Pareja et al. 2020).Such approaches have demonstrated remarkable performance on various applications, such as email spam detection (Akoglu et al. 2015), recommender systems (Cao et al. 2019;Ying et al. 2018), name disambiguation in citation networks (Zhang et al. 2018), molecular generation (You et al. 2018;Bresson and Laurent 2019), and so on.
Learning dynamic embeddings that preserve the timevarying structure and node interactions of an evolving graph is a fundamental problem.Existing representation strategies apply several techniques among consecutive graph snapshots to learn accurate node embeddings, such as recurrent neural networks (Pareja et al. 2020), attention mechanisms (Sankar et al. 2020), and temporal regularizers (Li et al. 2017).To preserve the structural and temporal properties of evolving graphs without loss of information, such strategies design neural network architectures with a large amount of model parameters.Given the large model size, existing strategies 100 Page 2 of 16 present high online inference latency.Despite their remarkable achievements in producing accurate node embeddings, these strategies are not suitable for real-world applications with almost real-time requirements of inference latency.For example, distributed live video streaming solutions in large enterprises, exploit the offices' internal high bandwidth network to distribute the video content among viewers/nodes (Roverso et al. 2015).To select the high-bandwidth connections, distributed solutions exploit graph neural networks to predict the throughput among viewers in real-time (Antaris and Rafailidis 2020b).Based on the predicted throughput, each viewer adapts the connections to efficiently distribute the video content.However, the high online inference latency of the large models negatively impacts the distribution of the video stream in an enterprise network (Antaris et al. 2020).
To reduce the online inference latency, knowledge distillation has been recently introduced to generate compact models without any information loss (Hinton et al. 2015;Bucilua et al. 2006).In particular, knowledge distillation trains a cumbersome model, namely teacher, without stringent requirements on inference latency.Therefore, the teacher model is trained as an offline process and employs neural networks with large model sizes.Once the teacher model is trained, the knowledge can be distilled to a compact model, namely student, through a well-designed distillation loss function.Therefore, the student model has significantly lower number of model parameters than the teacher model, while preserving high performance accuracy.Given the low online inference latency due to the small model size, the student model can be deployed to online applications (Phuong and Lampert 2019;Tang and Wang 2018;Asif et al. 2020;Qian et al. 2020;Kim and Rush 2016).
Existing knowledge distillation strategies fall in two main categories: (1) feature-based strategies that exploit the generated features of either the last layer or the intermediate layers as a supervision signal to train the student model (Romero et al. 2015;Huang and Wang 2017;Gou et al. 2021), and (2) response-based strategies where the distillation loss function minimizes the differences between the predicted values of the teacher and student models (Chen et al. 2017;Meng et al. 2019).Recently, the impact of knowledge distillation has been studied for static graph representation learning strategies (Yang et al. 2020;Chen et al. 2020;Ma and Mei 2019).These representation strategies extract the structural knowledge of a static graph and perform distillation to transfer the acquired knowledge to a compact student model.However, such approaches ignore the temporal aspect of the evolving networks.A recent attempt to employ knowledge distillation on a dynamic graph representation learning strategy has been presented in our preliminary study of the Distill2Vec model (Antaris and Rafailidis 2020a).Distill2Vec trains a teacher model on the offline graph snapshots and transfers the knowledge to the student model when learning on the online data.Distill2Vec employs a feature-based distillation strategy, by adopting the Kullback-Liebler divergence on the teacher and student node embeddings in the distillation loss function.However, Distill2Vec focuses only on the final node embeddings of the teacher model, ignoring the prediction accuracy of the teacher model and the additional information captured on the intermediate features/layers.
In this article, we propose a Dynamic Graph representation learning model with Knowledge Distillation.Dyn-GKD extends the Distill2Vec model, making the following contributions: • We propose different distillation loss functions based on the way that we transfer the knowledge from the teacher model to the student model.We also present a hybrid strategy to combine the teacher's features and responses in the distillation loss function, allowing the student model to distill more information than incorporating only one distillation strategy separately.• We conduct an extensive experimentation on networks with different characteristics, such as two real-world social networks and three evolving networks generated by live video streaming events.We demonstrate that our hybrid distillation strategy for the student model significantly reduces the model size, while constantly outperforms the teacher model and the baseline strategies in the link weight prediction task.
The remainder of the paper is organized as follows: Sect. 2 reviews the related work and in Sect. 3 we formulate the problem of knowledge distillation on dynamic graph representation learning.Section 4 describes the proposed model DynGKD, the experimental evaluation is presented in Sect.5, and we conclude the study in Sect.6.

Graph representation learning
Static Approaches Graph representation learning has attracted a surge of research in the recent years (Wang et al. 2020).Early attempts adopt traditional network embedding techniques, by capturing the distribution of the positive node pairs in the latent space (Hamilton et al. 2017b). DeepWalk (Perozzi et al. 2014) performs random walks on the graph and adopts the skip-gram model (Mikolov et al. 2013) to learn accurate network embeddings that correspond to the log-likelihood of observed nodes in the walks.Node2Vec (Grover and Leskovec 2016) extends the DeepWalk model by biasing the random walks, so as to compute different properties of the graph.Recently, HARP (Chen et al. 2018a) coarsens related nodes to supernodes to improve the performance of random walks in DeepWalk and Node2Vec.
With the remarkable performance of neural networks, graph representation learning adopts various architectures to compute accurate node embeddings.Existing approaches employ spectral convolutions on graphs to learn node embeddings with similar structural properties (Kipf and Welling 2017).To alleviate the scalability issues of spectral convolutions on large graphs, GraphSage (Hamilton et al. 2017a) and FastGCN (Chen et al. 2018b) employ neighbrhood and importance sampling, respectively.GAT exploits an attention mechanism to transform the node properties to low-dimensional embeddings (Vaswani et al. 2017).Recently, DAGNN decouples the convolution operation to two key factors, that is node properties transformation and propagation (Liu et al. 2020).This allows DAGNN to improve the performance of convolution on graphs by adopting deeper neural network architectures than the previous approaches.However, such approaches are applied on static graphs, ignoring the dynamic properties of the nodes in evolving graphs.
Dynamic Approaches Dynamic graph representation learning approaches adopt various neural network architectures to model the structural and temporal properties of evolving graphs.Early attempts model the evolving graph as an ordered collection of graph snapshots and extend the static approaches to learn accurate node embeddings (Hamilton et al. 2017b).CTDNE employs temporal random walks on consecutive graph snapshots and applies the skip-gram model to learn the transition probability among two nodes (Nguyen et al. 2018).DNE adopts random walks on each graph snapshot and adjusts the embeddings of the nodes that present significant changes among consecutive graph snapshots (Du et al. 2018).DynGEM employs deep autoencoders to compute the structural properties of each graph snapshot and exploits temporal smoothness methods to ensure stability on the computed node embeddings among consecutive graph snapshots (Goyal et al. 2018).Recently, DynVGAE uses variational auto-encoders (Kipf and Welling 2016) to ensure temporal smoothness by introducing weight parameter sharing among consecutive models (Goyal et al. 2018).DynamicTriad exploits the triadic closure as a supervised signal to capture the social homophily property in evolving social networks (Zhou et al. 2018).EvolveGCN (Pareja et al. 2020) and DynGraph2Vec (Goyal et al. 2020) adopt recurrent neural networks (RNNs) among consecutive graph convolutional networks (Kipf and Welling 2017) to model the evolution of the graph in the hidden state.Furthermore, DySAT (Sankar et al. 2020) employs graph attention mechanisms to capture the evolution of the graph.
Recent approaches compute the node embeddings by leveraging the time ordered node interactions of the evolving graph.DeepCoevolve exploits RNNs to define a point process intensity function, which allows the model to compute the influence of an interaction on the node embedding over time (Dai et al. 2016).Jodie (Kumar et al. 2019) employs both attention mechanism and RNN to predict the evolution of the node and adapt the generated embeddings accordingly.Moreover, TigeCMN (Zhang et al. 2020) designs coupled memory networks to store and update the node embeddings, while TDIG-MPNN (Chang et al. 2020) updates the embeddings through message passing on the high-order correlation nodes.However, existing approaches require large model sizes to efficiently capture the evolution of the graph.Due to the large model sizes, such approaches have significant online inference latency.

Knowledge distillation
Knowledge distillation has been widely adopted in several machine learning domains, such as image recognition (Ba and Caruana 2014;Hinton et al. 2015), recommender systems (Tang and Wang 2018;Chen et al. 2017), language translation (Kim and Rush 2016) to generate compact student models with low online inference latency (Bucilua et al. 2006).Knowledge distillation can be divided into three main categories: (1) response-based, (2) feature-based, and (3) relation-based strategies.The response-based strategies distill knowledge to the student model by exploiting the output of the teacher model.Response-based approaches consider the final output of the teacher model as a soft label to regularize the output of the student model (Hinton et al. 2015;Mirzadeh et al. 2020;Kim et al. 2021).Accounting for the output of teacher model to distill knowledge, the student model fails to capture the intermediate supervision applied by the teacher model (Gou et al. 2021).Instead, featurebased approaches distill the high-level information acquired by the teacher to the output features.FitNet incorporates the features of the intermediate layers to supervise the student model, minimizing the L2 distillation loss function (Romero et al. 2015).Moreover, Zhou et al. (2018a) consider the parameter sharing of intermediate layers among teacher and student models.Recently, Chen et al. (2020) formulate a feature embedding task to match the dimensions of the output features generated by the teacher and student models.Although response-based and feature-based strategies distill knowledge from the outputs of specific layers in the teacher model, these approaches ignore the the semantic relationship among the different layers.To handle this issue, relation-based approaches explore the relationships between different feature maps.SemCKD follows a cross-layer knowledge distillation strategy to address the different semantics of intermediate layers in the teacher and student models (Chen et al. 2021).In the IRG model, the student model distills knowledge from the Euclidean distance between examples observed by the teacher model (Liu et al. 100 Page 4 of 16 2019).In Zagoruyko and Komodakis (2017), Zagoruyko et al. derive an attention map from the teacher's features to distill knowledge to the student model, while KDA (Qian et al. 2020) employs the Nyström low-rank approximation for kernel matrix to distill knowledge through landmark points.Nevertheless, such approaches focus on image processing, and have not been evaluated on evolving graphs.
A few attempts have been made to follow knowledge distillation strategies to reduce the model size of graph representation learning approaches.DMTKG calculates Heat Kernel Signatures to compute the node features, which are used as inputs into Graph Convolutional Networks (GCNs) to learn accurate node embeddings (Lee and Song 2019).DMTKG generates a compact student model by applying a response-based knowledge distillation strategy based on the weighted cross entropy distillation loss function.DistillGCN is a feature-based strategy that exploits the output features of the teachers' GCN layers to transfer the structural information of the graph to the student model (Yang et al. 2020).The distillation loss function in DistillGCN minimizes the prediction error of the student model on the online data and the Kullback Leibler divergence of the features generated by the teacher and student models.However, existing knowledge distillation strategies are designed to transfer knowledge from static graphs, ignoring the evolution of dynamic graphs.

Problem formulation
We model the evolution of a dynamic graph as a collection of graph snapshots over time, which is defined as follows (Sankar et al. 2020;Pareja et al. 2020;Nguyen et al. 2018;Antaris et al. 2020): In an evolving graph, the node set V k and edge set E k vary among consecutive graph snapshots.As illustrated in Fig. 1, node f ∈ V 2 joins the evolving graph in the snap- shot G 2 , creating a new connection e 2 (f , d) ∈ E 2 with node d ∈ V 2 .Moreover, in the final snapshot G K , node b disap- pears, removing the respective edges.A graph representation learning strategy on evolving graphs is defined as follows (Antaris et al. 2020;Antaris and Rafailidis 2020b;Pareja et al. 2020): The learned node embeddings k should accurately capture the structural and temporal evolution of the graph, up to the k-th timestep.
Our study focuses on knowledge distillation strategies for dynamic graph representation learning approaches, as defined in the following (Antaris and Rafailidis 2020a

Method overview
As illustrated in Fig. 2, the DynGKD model consists of the teacher and student models.The goal of the proposed Dyn-GKD model is to train a large teacher model as an offline process and distill the knowledge of the teacher model to a small student model during training on the online data.
Teacher Model The teacher model takes as input the m offline graph snapshots The role of the teacher model is to learn the node embeddings T to capture both the structural and temporal evolution of the m offline graph snapshots.Following Sankar et al. (2020), we adopt two self-attention layers to capture the structural and temporal evolution of the m graph snapshots.Provided that the teacher

Teacher model on the offline data
The teacher model DynGKD-T learns the node embeddings T based on l consecutive graph snapshots.Provided that we pretrain the teacher model on the offline graph snapshots , we consider all the m snapshots during training ( l = m ), with m ≪ K .Following Sankar et al. (2020), we capture the evolution of the graph by employing two selfattention layers.The structural attention layer captures the structural properties of each graph snapshot, and the temporal attention layer learns the complex temporal properties of the evolving graph.
Structural Attention Layer Given the offline graph snapshots G T = {G T 1 , … , G T m } , the input of the structural attention layer is the m feature vectors { 1 , … , m } .The structural attention layer computes l structural node representations (u) ∈ ℝ l×d , with l = m , of the node u ∈ V by implementing a multi-head self-attention mechanism as follows (Veličković et al. 2018;Vaswani et al. 2017): (1) where g is the number of attention heads, N k (u) is the neighborhood set of the node u ∈ V k at the graph snapshot G T k , ∈ ℝ d×c is the weight transformation matrix of the c-dimensional node features k (u) , and ELU is the expo- nential linear unit activation function.Variable a k (u, v) is the attention coefficient among the node u ∈ V k and v ∈ V k , defined as follows: where || is the concatenation operation to aggregate the transformed feature vectors of the nodes u ∈ V k and v ∈ V k .Variable ⊤ is a 2d-dimensional weight vector parameterizing the aggregated feature vectors.The attention weight a k (u, v) expresses the impact of the node v on the node u at the k-th timestep, when compared with the neighborhood set N k (u) (Vaswani et al. 2017).
Temporal Attention Layer The input of the temporal attention layer is the m structural node embeddings { 1 , … , m } learned by the structural attention layer.The temporal attention layer aims to capture the evolution of the graph over time.Therefore, we design the multi-head scale-dot product form of attention to learn m temporal representation (u) ∈ ℝ m×d of each node u ∈ V k , as follows (Sankar et al. 2020;Vaswani et al. 2017): (2) where p is the number of attention heads, value ∈ ℝ d×d is the weight parameter matrix to transform the structural embeddings (u) .Value (u) ∈ ℝ l×l is the attention weight matrix, with l = m for the teacher model, calculated as follows: where key ∈ ℝ d×d and query ∈ ℝ d×d are the weight parameter matrices of the l structured node embeddings (u) .The attention weight matrix (u) indicates the dif- ferences between the structural node embeddings over the l graph snapshots (Vaswani et al. 2017).
Having learned the l structural node embeddings (u) and temporal node embeddings (u) of the node u, we calculate the final node embedding T l (u) of the teacher model at the l-th timestep, as follows: To train the teacher model, we formulate the Root Mean Squared Error (RMSE) loss function with respect to the node embeddings T l : where is the sigmoid activation function and the term is the reconstruction error of the neighborhood N l (u) of the node u ∈ V l , at the l-th timestep.We optimize the weight parameter matrices in both the structural and temporal attention layers based on the loss function in Eq. 6 and the backpropagation algorithm with the Adam optimizer (Kingma and Ba 2015).

Knowledge distillation strategies
The student model learns the node embeddings S based on the online graph snapshots G S .Provided that the stu- dent model distills knowledge from the pretrained teacher model DynGKD-T , the student model requires a significantly low number of model parameters.At each timestep k = m + 1, … , K , the student model considers l consecutive graph snapshots {G k−l , … , G k } , with l ≪ K .We compute the l structural and temporal node embeddings based on Eqs. 1 and 3, respectively.The final node embeddings are computed according to Eq. 5.
To transfer the knowledge from the teacher model, the student model formulates a distillation loss function Feature-Based Distillation Strategies (DynGKD-F1 /DynGKD-F2 ) As aforementioned in Sect.4.2, the teacher model computes node embeddings T that preserve both the structural and temporal evolution of the graph.We exploit the node embeddings T to supervise the training of the student model through the distillation loss function in the feature-based distillation strategy DynGKD-F1 as follows: where is the Kullback-Liebler (KL) divergence among the node embeddings S k and T k .The L F1 k term prevents the student node embeddings S k to be significantly different from the teacher node embeddings T k .Therefore, by optimizing Eq. 8, the student model generates node embeddings S k that match the embedding space of the teacher model.Instead, the distillation loss function of the response-based strategy DynGKD-R in Eq. 7 allows the student model to predict similar values with the teacher model, regardless of the differences between the generated node embeddings S k and T k .
(7) min To generate the final node embeddings T , the teacher model DynGKD-T computes intermediate structural embeddings T and temporal embeddings T based on Eqs. 1 and 3, respectively.Following (Romero et al. 2015), in the second examined feature-based distillation strategy Dyn-GKD-F2 , we adopt the intermediate embeddings as a supervision signal to improve the training of the student model.We formulate the distillation loss function in DynGKD-F2 to incorporate both the structural and temporal embeddings of the teacher model as follows: where is the KL divergence among the structural and temporal node embeddings of the teacher and student models.Note that the feature-based knowledge distillation of DynGKD-F1 in Eq. 8 allows the student model to mimic only the final embeddings of the teacher model.In contrast, the distillation loss function of DynGKD-F2 in Eq. 9 transfers knowledge by matching both the structural and temporal embedding spaces of the teacher and student models.Therefore, the intermediate embeddings where The term L T k corresponds to the prediction error of the teacher model based on Eq. 6, and L F2 k is the KL divergence of the intermediate embeddings generated by the teacher and student models, according to Eq. 9.In contrast to the previously examined distillation strategies, the hybrid loss function in Eq. 10 allows the student model to distill knowledge from different outputs of ( 9) min the teacher model at the same time.This means that the L T k loss in Eq. 10 allows the student model to have similar prediction accuracy to the teacher model.In addition, the structural and temporal embeddings S k and S k of the student model reflect on the structural and temporal embedding space of the teacher model.As we will demonstrate in Sect.5.4, this hybrid strategy allows the student model to learn more accurate node embeddings, achieving high prediction accuracy, while significantly reducing the number of required parameters.

Datasets
To evaluate the performance of the examined models in different network characteristics, we use two datasets of social networks, and three datasets of live video streaming events.In Table 1, we summarize the datasets' statistics.
Social networks We consider two bipartite networks from Yelp1 and MovieLens2 (ML-10M).Each graph snapshot in the Yelp dataset corresponds to the ratings of the users to the businesses within a 6 month period.In ML-10M, each graph snapshot consists of the user ratings to the movies within a 1 year period.
Video streaming networks We consider three video streaming datasets, which correspond to the connections of the viewers in real live video streaming events in enterprise networks (Antaris et al. 2020).The duration of each event is 80 minutes.As aforementioned in Sect. 1, each viewer establishes a limited number of connections, so as to distribute the video content with other viewers, using the offices' internal high bandwidth network.To efficiently identify the highbandwidth connections, each viewer periodically adapts the connections.We monitor the established connections during a live video streaming event and model the viewers interactions as an undirected weighted dynamic graph.The weight of a graph edge corresponds to the throughput measured between two nodes/viewers.Each dataset consists of 8 discrete graph snapshots, where each graph snapshot corresponds to the viewers' interactions with a 10 minute period.

Evaluation protocol
We evaluate the performance of our proposed model on the link weight prediction task on the online graph snapshots G S .For the social networks, we consider the first 5 timesteps as the offline data and the remaining timesteps are the online data, that is 11 and 9 online graph snapshots for the Yelp and ML-10M datasets, respectively.For each dataset of the video streaming networks, we consider the first 3 graph snapshots as the offline data, and the remaining 5 correspond to the online data.
The task of link weight prediction is to predict the weight of the unobserved edges U k+1 = E k+1 ⧵ {E 1 , … , E k } in the k + 1 time step, given the node embeddings S k generated by the student model at the k-th timestep.Following (Antaris et al. 2020), we concatenate the node embeddings S k (u) and S k (v) , for each connection (u, v) ∈ E k , based on the Had- amard operator, and train a Multi-Layer Perceptron (MLP) model, using negative sampling.Having trained the MLP model, we input the concatenated node embeddings for the unobserved edges U k+1 , to calculate the predicted weights.
We measure the prediction of each examined model, based on the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) metrics, defined as follows: Note that the RMSE metric emphasizes on large prediction errors, rather than the MAE metric does.Following Pareja et al. (2020); Sankar et al. (2020); Antaris et al. (2020), to train the model at each time step k, we randomly select 20% of the unobserved links for validation set.The remaining 80% of the unobserved links are used as test set.We repeat each experiment 5 times and report the average RMSE and MAE over the five trials.

Examined models
We compare the performance of our proposed model with the following examined models: • DynVGAE Mahdavi et al. ( 2020) is a dynamic joint variational auto-encoder architecture that shares parameters over consecutive graph auto encoders.Given that there is no publicly available implementation, we implemented DynVGAE from scratch and publish our code.Williams and Seeger (2001).For reproducability purposes, we publish our implementation,9 as there is no publicly available implementation of KDA.• DynGKD-T the teacher model of the proposed approach, presented in Sect.4.2.
• DynGKD-R the student model that distills the knowledge from the DynGKD-T model based on the response-based distillation strategy in Eq. 7.
• DynGKD-F1 the student model that employs the featurebased distillation loss function in Eq. 8 to transfer knowledge from the teacher model.• DynGKD-F2 the student model that mimics the intermediate node embeddings of the teacher model through the feature-based loss function in Eq. 9.
• DynGKD-H the proposed hybrid distillation strategy that transfers knowledge from the teacher model through the loss function in Eq. 10.
For all the examined models, we tuned the hyper-parameters based on a grid selection strategy on the validation set and report the results with the best configuration.The operating system of the server was Ubuntu 18.04.5 LTS, and we implemented the DynGKD model using PyTorch 1.7.1.In Sect.5.5, we study the influence of the knowledge distillation parameter on the performance of each student model.

Performance evaluation
In Fig. 3, we evaluate the performance of the proposed hybrid student model DynGKD-H against the non-distillation strategies, in terms of RMSE and MAE.We observe that the proposed hybrid student model constantly outperforms the baseline approaches in all datasets.This suggests that DynGKD-H can efficiently learn node embeddings S to capture the evolution of the online graph snapshots.Compared with the second best method DyREP, the proposed DynGKD-H model achieves high link weight prediction accuracy, with average relative drops 38.32 and 88.68% in terms of RMSE and MAE in Yelp, 15.77 and 30.07% in the ML-10M dataset.For the video streaming datasets, the DynGKD-H model achieves average relative drops 26.43 and 59.33% in LiveStream-4K, 33.44 and 22.89% in LiveStream-6K, 89.35 and 19.17% in LiveStream-16K.The DyREP model outperforms the other baseline approaches, as the learned node embeddings capture the temporal changes of each node's neighborhood between consecutive graph snapshots.Instead, the other baseline approaches focus only on the structural changes of the graphs, ignoring the evolution of the node interactions over time.However, the DyREP model is designed to identify historical patterns on the nodes connections over time (Trivedi et al. 2019).Therefore, DyREP has low prediction accuracy on dynamic graphs that are updated significantly over time (Liu et al. 2020).The proposed DynGKD-H model overcomes this problem by capturing both the structural and temporal evolution of the graph using two self-attention layers.The structural attention layer contains the information of the structural properties of each graph snapshot, while the temporal attention layer captures the evolution of the generated structural embeddings over time.Therefore, the learned node embeddings of the DynGKD-H model reflect on the temporal variations of the graph.In Fig. 4, we demonstrate the performance of the examined knowledge distillation strategies in terms of RMSE and MAE.We compare the proposed DynGKD model against DMTKG (Ma and Mei 2019), a baseline approach which employs a response-based distillation strategy on graph neural networks.Moreover, we evaluate our proposed model against KDA (Qian et al. 2020), a modelagnostic relation-based strategy.Given that we exploit the proposed DynGKD model, presented in Sect. 4 as the teacher model KDA-T , we omit the results of KDA-T .On inspection of Fig. 4, we observe that the hybrid student model DynGKD-H constantly outperforms its variants in all datasets.Note that DynGKD-H exploits the hybrid distillation strategy in Eq. 10 to transfer different types of information from the teacher model.Therefore, the DynGKD-H student model mimics both the prediction accuracy and the generated structural and temporal node embeddings of the teacher.Instead, the response-based DynGKD-R and the feature-based DynGKD-F1 and Dyn-GKD-F2 approaches, exploit only a single source of information from the teacher.This means that the DynGKD-H strategy distills rich information from the teacher model, which allows the student model to capture the evolution of the graph more efficiently than its variants.We also observe that all the examined variants of the DynGKD student model constantly outperform the DynGKD-T teacher model in all datasets.This indicates that the generated student models overcome any bias introduced by the teacher model DynGKD-T , during training on the offline data.In addition, the proposed hybrid student model DynGKD-H achieves higher prediction accuracy than the DMTKG-T and DMTKG-S models in all datasets.This occurs because DMTKG model is designed to learn node embeddings on static graphs, ignoring the evolution of the graph.Compared with the relation-based strategy KDA-S , our proposed DynGKD-H model achieves superior performance in all datasets.KDA employs the Nyström low-rank approximation to distill knowledge through the selected landmark points, ignoring the temporal evolution of the nodes.In Tables 2, 3 and 4, we present the number of required parameters in millions to train each examined distillation strategy.Moreover, we report the online inference time in seconds to compute the node embeddings.As aforementioned in Sect.4, the teacher models DynGKD-T , DMTKG-T and KDA-T learn the node embeddings based on the offline graph snapshots G T .Therefore, the DynGKD-T , DMTKG-T and KDA-T models have the same number of required parameters and inference times, when evaluated on the online graph snapshots G S .As aforementioned in Sect.5.3, KDA-T adopts the DynGKD model, thus requiring the same number of parameters as the DynGKD-T model.Moreover, the feature-based strategies DynGKD-F1 and DynGKD-F2 and the hybrid strategy DynGKD-H require the same number of parameters to train on the online graph snapshots.This occurs because the distillation loss functions in Eq. 8-10 exploit the features of the teacher to transfer knowledge to the student model.Therefore, the student model computes node embeddings that reflect on the same embedding space as the teacher embeddings.Instead, in DynGKD-R the student model distills knowledge from the predicted values of the teacher model, ignoring the information captured by the generated embeddings, thus having different number of required parameters.In addition, the response-based KDA-S strategy achieves high prediction accuracy when setting the same number of parameters as DynGKD-F1 , DynGKD-F2 and DynGKD-H , thus achieving similar inference time.On inspection of Tables 2, 3 and 4, we observe that the feature-based distillation strategies DynGKD-F1 and DynGKD-F2 , and the hybrid strategy DynGKD-H require significantly a lower number of parameters than the response-based approach DynGKD-R .This occurs because the generated model in DynGKD-R strategy requires high number of attention heads to capture the multiple facets of the evolving graph.In addition, we calculate the compression ratio of the DynGKD-H strategy, as the number of parameters to train the student model, when compared with the size of the teacher model.The DynGKD-H student model achieves average compression ratios of 21:100 and 69:100 for the Yelp and ML-10M datasets, when compared with the required parameters of the teacher model Dyn-GKD-T .For the live video streaming datasets, the averaged compression ratios are 13:100 in LiveStream-4K, 14:100 in LiveStream-6K, and 32:100 in Livestream-16K.Provided that the proposed hybrid student model DynGKD-H reduces the model size, we demonstrate the impact of the number of parameters on the online inference time of the node embeddings.We observe that the DynGKD-H achieves a speed up factor ×17 on average, for the latency inference time in Yelp dataset, when compared with the DynGKD-T teacher model.DynGKD-H achieves a speed up factor ×25 for the latency inference time in ML-10M dataset, ×42 in LiveS- tream-4K, ×23 in LiveStream-6K, and ×36 in Livestream- 16K.On inspection of Tables 2, 3 and 4 and Fig. 4, we find that the feature-based strategies and the hybrid strategy require the same amount of parameters during training.However, the DynGKD-H constantly outperforms the feature-based strategies in terms of RMSE accuracy.This occurs because the DynGKD-H exploits multiple types of information from the teacher model, which allows the student model to overcome any bias captured in the learned node embeddings of the teacher model.Therefore, the proposed DynGKD-H student model significantly reduces the model size and the online inference latency, while achieving high prediction accuracy.

Impact of knowledge distillation
In Fig. 5, we present the impact of the hyper-parameter (Eqs.7-10) on the performance of each knowledge distillation strategy.We vary the hyper-parameter from 0.1 to 0.9 by a step of 0.1, to study the influence of the teacher DynGKD-T model on the different distillation loss functions L D , during the online training process of the student model.For each value of , we report the average RMSE value of each knowledge distillation strategy over all the online graph snapshots G S .We observe that all strategies achieve the best prediction accuracy when the hyper-parameter is set to 0.4 for the Yelp and ML-10M datasets, and 0.3 in LiveStream-4K, LiveStream-6K and LiveStream-16K.For low values of , the distillation strategies emphasize more on the knowledge of the teacher model than the student model.Therefore, the student model prevents any additional training on the online data, which negatively impacts the learned embeddings to capture the evolution of the graph.Instead, increasing the value of degrades the performance of the distillation strategies.This occurs because the student model is trained with low supervision from the teacher model.Additionally, in Fig. 5, we study the influence of the hyper-parameter on the online inference latency of the knowledge distillation strategies.For each value of , we report the average inference latency in seconds over all the online graph snapshots.Given that DynGKD-F1 , Dyn-GKD-F2 and DynGKD-H exploit the teacher embeddings

Conclusions
In this article, we presented the DynGKD model, a knowledge distillation strategy on neural networks for evolving graphs.The proposed DynGKD model generates a compact student model which efficiently captures the evolution of the online graph snapshots in the node embeddings, while achieving low online inference latency.We introduced three distillation loss functions to transfer different types of information from the teacher model to the student model.Moreover, we proposed a hybrid distillation loss function that combines both the predicted values and the embeddings of the teacher model to the student model.This allows the student model to distill different information from the teacher model and capture the evolution of the online graph snapshots, while requiring small model sizes.Evaluated against several baseline approaches on five real-world datasets, we demonstrated the efficiency of our model to reduce the online inference latency on the online data, while achieving high prediction accuracy.Our experiments showed that the proposed hybrid student model DynGKD-H achieves 40.60 and 44.02% relative drops in terms of RMSE and MAE in all datasets, respectively.Moreover, the proposed hybrid student model achieves averaged compression ratio of 21:100, when compared with the teacher model, which corresponds to a speed up factor ×30 on average for the online inference time.An interesting future direction is to explore adaptive aggregation strategies on the generation of the structural and temporal embeddings in Eq. 5, aiming to capture various properties of the graph evolution (Hamilton et al. 2017a;Wang et al. 2020).Accounting for the dynamic nature of the evolving graphs, we also plan to investigate relation-based approaches with online distillation strategies, where both the teacher and the student models capture the semantics of the data examples simultaneously (Guo et al. 2020;Mirzadeh et al. 2020;Sun et al. 2021).This means that the teacher model will not only compute the structural and temporal properties of an evolving graph, but will also assist the student model with updated information.

): Definition 3
Knowledge Distillation on Dynamic Graph Representation Learning The goal of knowledge distillation is to generate a compact student model S , which distills the knowledge acquired by a large teacher model T .The teacher model T learns the node embeddings T based on the first m graph snapshots G T = {G T 1 , … , G T m } , with 1 < m < K , which correspond to the offline data.Having pretrained the teacher model T , the student S computes the node embeddings S on the online data G S = {G S m+1 , … , G S K } .The student model S distills the knowledge from the teacher T through a distil- lation loss function L D .

Fig. 1
Fig. 1 Overview of an evolving graph over time.We denote with dotted lines the new nodes/edges of the graph snapshot

Fig. 2
Fig. 2 Overview of the Dyn-GKD model, given a sequence of discrete graph snapshots G .The teacher model is trained as an offline process based on the offline graph snapshots G T .Then, the student model is trained on the online graph snapshots G S and distills knowledge from the teacher model through the distillation loss function L D

2
during the online training.As aforementioned in Sect. 1, the knowledge distillation strategies can be categorized as response-based and feature-based, according to the type of information to transfer the information from the teacher model(Gou et al. 2021).To evaluate the impact of each distillation strategy on the learned embeddings S of the student model, we formulate one response-based and two feature-based distillation loss functions.Moreover, we propose a hybrid distillation loss function that exploits both the responses and features of the teacher model during the online training process of the student model.Response-Based Distillation Strategy (DynGKD-R ) In the response-based distillation strategy DynGKD-R , we focus on the prediction accuracy of the teacher model on the online data.The student model learns the node embeddings S k at the k-th timestep, by minimizing the following distillation loss function (Antaris et al. 2020): where L S k is the root mean squared error of the student model on the online data.Value L T k is the prediction error of the teacher model DynGKD-T in Eq. 6 on the online data.Hyper-parameter ∈ [0, 1] controls the tradeoff between the two losses.High values balance the training of the node representations S k towards the errors of the student model, ignoring the knowledge of the teacher model.The distillation loss function L D of DynGKD-R allows the stu- dent model to mimic the reconstruction errors of the teacher model based on Eq. 6, while significantly reducing the model size (Antaris et al. 2020).

S
k and S k of the student model are similar to the intermediate node embeddings T k and T k of the teacher model.Hybrid Distillation Strategy (DynGKD-H ) Although the above strategies provide favorable information for the training of the student model, such strategies focus on transferring individual instance of knowledge from the teacher to the student.In particular, the student model is trained to mimic either the predicted values or the learned embeddings of the teacher model.To train a student model that distills knowledge from both the responses and the embeddings of the teacher model, we formulate a hybrid strategy DynGKD-H based on the following distillation loss function: In particular, in DynVGAE and DynamicTriad, we set the embedding size to d = 256 with window size l = 2 in all datasets.In TDGNN and DyREP, the embedding size is set to d = 128 , with l = 3 previous graph snapshots.EvolveGCN uses d = 32 node embeddings dimension and l = 2 window size.In DMTKG- T , the emebdding size is fixed to d = 512 in all datasets.In DMTKG-S , we reduce the embedding size to d = 32 .In KDA-T and our teacher model DynGKD-T , we set the embedding size to d = 128 , with l = 5 consecutive graph snapshots, and the number of head attentions is set to 16 in both Eqs. 1 and 3.In the feature-based student models Dyn-GKD-F1 and DynGKD-F2 , the hybrid model DynGKD-H , and the relation-based student model KDA-S , we reduce the number of head attentions to 2 and the window size l = 3 , while the node embedding dimension is set to d = 128 for all datasets.The student model DynGKD-R achieves the best performance when apply 8 head attentions, with d = 64 node embedding size and l = 2 previous graph snapshots.As aforementioned in Sect.4.3, the response-based distillation strategy DynGKD-R focuses on the prediction error of the teacher model, ignoring the information contained in the generated feature embeddings.Therefore, the student model DynGKD-R requires higher number of attention heads than the feature-based models, to capture the multiple latent facets of the online graph snapshots.Instead, the featurebased approaches distill information from the teacher output embeddings, which contain the information of the different latent facets.Therefore, the feature-based approaches Dyn-GKD-F1 , DynGKD-F2 and DynGKD-H require 2 attention heads during training on the online data.All experiments were executed on a single server with a CPU Intel Xeon Bronze 3106, 1.70GHz and a GPU Geforce RTX 2080 Ti.

Fig. 3
Fig. 3 Performance comparison of the proposed hybrid student model DynGKD-H against the non-distillation strategies in terms of RMSE and MAE

Fig. 4
Fig. 4 Comparison of the examined teacher and student models in terms of RMSE and MAE to transfer knowledge to the student model, the generated student models learn node embeddings that are similar to the embeddings of the teacher model.Therefore, the student models in DynGKD-F1 , DynGKD-F2 and Dyn-GKD-H require the same number of parameters during training.We omit the DynGKD-F1 and DynGKD-F2 strategies, as the student models have the same inference latency as DynGKD-H .On inspection of Fig.5, we observe that the best values, in terms of online inference latency, are 0.4 in Yelp and ML-10M, and 0.3 in LiveStream-4K, LiveStream-6K and LiveStream-16K.Increasing the value of negatively impacts the online inference latency of the student model.This occurs because the student model distills less knowledge from the teacher model.As a consequence, the student model requires a large amount of parameters to capture the evolution of the graph.This observation reflects on the importance of balancing the impact of the teacher model on the training process of the student model.

Fig. 5
Fig. 5 Impact of parameter on the prediction accuracy and online inference latency of the examined DynGKD student models.We omit the DynGKD-F1 and DynGKD-F2 strategies, as they have the same model size with DynGKD-H for the online inference latency experiments

Table 4
Inference time in seconds and #Parameters in millions of the examined distillation strategies for the LiveStream datasets